What are Top-K & Top-P in LLM?

In this blog, we explain how top-k, top-p influence large language models by controlling token sampling probabilities, balancing randomness, and improving output consistency.

For detailed information, please watch the YouTube video: What are Top-K & Top-P in LLM?: Simply Explained

When working with large language models, a common challenge is adjusting parameters like top-k, top-p, and temperature. These parameters directly influence the model’s behavior and output. In this blog, we’ll explore their impact on text generation.

The Role of Parameters in Large Language Models

Imagine we prompt a model with “how are you”. The model processes this input to predict the next token. Here’s what happens:

  1. Token Representation:
    The model computes vectorized representations for each token in the input (e.g., “how,” “are,” “you,” and “?”). These vectors encode the semantic meaning of each token.
  2. Probability Distribution:
    For the last token (“?”), the model predicts a probability distribution over all tokens in its vocabulary (e.g., 100,000 tokens). Each token receives a probability indicating how likely it is to be the next token.
  3. Token Sampling:
    The next token is sampled based on these probabilities. Higher probabilities increase the likelihood of selection. However, small-probability tokens, while unlikely, could still be sampled—similar to winning a low-odds lottery.

Without controls, this randomness can lead to irrelevant or incoherent outputs. That’s where top-k and top-p parameters come in, allowing us to refine the sampling process.

Top-k Sampling

Top-k limits the sampling to the top k tokens with the highest probabilities.

Example:

  • Vocabulary probabilities:
    • “I”: 0.1
    • “fine”: 0.07
    • “am”: 0.03
    • Other tokens: negligible probabilities

If k = 3, only the tokens “I,” “fine,” and “am” are considered for sampling. All others are ignored.

Normalizing Probabilities:

Before sampling, the probabilities are normalized to sum to 1:

  • “I”: 0.1×5=0.5
  • “fine”: 0.07×5=0.35
  • “am”: 0.03×5=0.15

Now, the model samples from this set. While “I” is the most probable, “fine” or “am” may still be chosen.

Key Benefit:
Top-k eliminates unlikely tokens, reducing randomness and making outputs more predictable.

Top-p Sampling (Nucleus Sampling)

Top-p takes a different approach, focusing on cumulative probability. Instead of fixing the number of tokens, it dynamically adjusts the candidate pool based on a probability threshold.

Example:

  • Vocabulary probabilities:
    • “I”: 0.1
    • “fine”: 0.07
    • “am”: 0.03
    • … (other tokens follow)

If p=0.7p = 0.7p=0.7, tokens are included in the pool until their cumulative probability reaches or exceeds 0.7:

  • “I” (0.1) → cumulative = 0.1
  • “fine” (0.07) → cumulative = 0.17
  • “am” (0.03) → cumulative = 0.2
  • … continue until cumulative≥0.7

Tokens outside this pool are ignored. The model then samples from the remaining tokens.

Key Benefit:
Top-p adapts the candidate pool size dynamically, providing a balance between randomness and control.

Conclusion

Parameters like top-k, top-p, and temperature are powerful tools for tailoring the behavior of large language models. By adjusting these settings, we can fine-tune the model’s output to be more consistent, relevant, or creative—depending on the use case. Mastering these parameters enables better control and unlocks the true potential of language models.

Leave a Reply

Your email address will not be published. Required fields are marked *