How do large language models like GPT decide what word comes next? Why do they sometimes produce creative, surprising answers, and other times sound deterministic and safe? The magic lies in decoding strategies and how we control the “creativity” of the model.
In this post, I will try and attempt to break down one of the most important processes in LLMs: decoding. Lets see how sampling works, how temperature influences it, and how top-k and top-p sampling balance randomness and coherence. My aim is to have an intuitive and mathematical grasp of how modern LLMs generate language.
🔢 Step-by-Step: How Decoding Works
1. Model Outputs Logits
When you prompt an LLM (e.g., “The cat sat on the”), it doesn’t immediately give you a word — it gives you a vector of logits. The logit scores are determined by the Neural networks, trained on the large data to provide the logit scores based on the vocabulary it is trained on. These are raw, unbounded numbers (can be positive or negative) representing how likely each token is to come next.
Example:
"mat" -> 3.0
"couch" -> 2.0
"floor" -> 1.0
"bed" -> 0.5
"banana" -> -1.0
When a language model outputs raw values (called logits) for each token in its vocabulary, those values aren’t probabilities yet. They’re just unnormalized scores — some might be negative, some positive, some large, some small.
To turn these into a probability distribution, we use a mathematical function called softmax.
The softmax function takes a vector of real numbers (logits) and squashes them into a probability distribution — i.e., all values between 0 and 1, and all summing to 1.
Given logits $$( z_1, z_2, \ldots, z_n ),$$ the probability $$ ( P_i ) $$ for each logit is computed using the softmax function: $$ [ P_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} ] $$
2. Temperature Scaling with Softmax
Temperature is a hyperparameter that adjusts the sharpness of the probability distribution.
- Low temperature (< 1) → more deterministic
- High temperature (> 1) → more creative/random
Mathematically:
$$ P_i = softmax(z_i / T) $$
Where z_i is the logit, T is the temperature.
Using the temperature parameter, we first scale the logit scores and use it to either the flatten the distribution or sharpen the distribution
And Softmax converts the scaled logits into a probability distribution:
$$ P_i = e^{z_i} / Σ e^{z_j} $$
Example probabilities:
"mat" -> 0.62
"couch" -> 0.23
"floor" -> 0.08
"bed" -> 0.05
"banana" -> 0.02
3. Sampling — Choosing the Next Token
Sampling means randomly choosing the next token based on probabilities.
- “mat” has a 62% chance
- “banana” still has a 2% chance → creativity unlocked
- Unlike greedy decoding, this allows diverse outputs
🎯 Adding Control: Top-k and Top-p Sampling
What is Top-k?
Top-k keeps only the k most probable tokens, discards the rest, and renormalizes.
Top-k = 3 Example:
Keep: ["mat", "couch", "floor"]
Discard: ["bed", "banana"]
What is Top-p (Nucleus Sampling)?
Top-p includes the smallest set of tokens whose cumulative probability ≥ p (e.g., 0.9).
Top-p = 0.9 Example:
"mat" -> 0.62
"couch" -> 0.23
"floor" -> 0.08 ✅ (total = 0.93)
Only these are kept for sampling.
How Temperature, Top-k, and Top-p Interact
They are complementary:
- Temperature reshapes the probability distribution
- Top-k and Top-p filter the candidates
- Their effects compound
⚙️ Interaction Table:
| Temp | Top-k | Top-p | Behavior |
|---|---|---|---|
| Low | Small | 0.8 | Deterministic, safe |
| Medium | Mid | 0.9 | Balanced creativity |
| High | Large | 0.95 | Diverse, possibly risky |
🧪 Practical Tips
- Use low T + top-p = 0.8 for reliable, fact-based outputs
- Use high T + top-p = 0.95 for poetry, storytelling, idea generation
- Use greedy decoding (T=0) when accuracy > diversity
🧩 Final Takeaway
Decoding isn’t just a technical step — it’s a creative control lever.
With temperature, top-k, and top-p, you shape how a model speaks:
🎯 Focused and precise?
🎭 Wild and imaginative?
🧠 Balanced and human-like?
Understanding decoding gives you the power to make LLMs work the way you want.