Demystifying Decoding in Large Language Models

How do large language models like GPT decide what word comes next? Why do they sometimes produce creative, surprising answers, and other times sound deterministic and safe? The magic lies in decoding strategies and how we control the “creativity” of the model.

In this post, I will try and attempt to break down one of the most important processes in LLMs: decoding. Lets see how sampling works, how temperature influences it, and how top-k and top-p sampling balance randomness and coherence. My aim is to have an intuitive and mathematical grasp of how modern LLMs generate language.

🔢 Step-by-Step: How Decoding Works

1. Model Outputs Logits

When you prompt an LLM (e.g., “The cat sat on the”), it doesn’t immediately give you a word — it gives you a vector of logits. The logit scores are determined by the Neural networks, trained on the large data to provide the logit scores based on the vocabulary it is trained on. These are raw, unbounded numbers (can be positive or negative) representing how likely each token is to come next.

Example:

"mat"     -> 3.0  
"couch"   -> 2.0  
"floor"   -> 1.0  
"bed"     -> 0.5  
"banana"  -> -1.0

When a language model outputs raw values (called logits) for each token in its vocabulary, those values aren’t probabilities yet. They’re just unnormalized scores — some might be negative, some positive, some large, some small.

To turn these into a probability distribution, we use a mathematical function called softmax.

The softmax function takes a vector of real numbers (logits) and squashes them into a probability distribution — i.e., all values between 0 and 1, and all summing to 1.

Given logits $$( z_1, z_2, \ldots, z_n ),$$ the probability $$ ( P_i ) $$ for each logit is computed using the softmax function: $$ [ P_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} ] $$

2. Temperature Scaling with Softmax

Temperature is a hyperparameter that adjusts the sharpness of the probability distribution.

Low temperature (< 1) → more deterministic
High temperature (> 1) → more creative/random

Mathematically:

$$ P_i = softmax(z_i / T) $$

Where z_i is the logit, T is the temperature.

Using the temperature parameter, we first scale the logit scores and use it to either the flatten the distribution or sharpen the distribution

And Softmax converts the scaled logits into a probability distribution:

$$ P_i = e^{z_i} / Σ e^{z_j} $$

Example probabilities:

"mat"     -> 0.62  
"couch"   -> 0.23  
"floor"   -> 0.08  
"bed"     -> 0.05  
"banana"  -> 0.02

3. Sampling — Choosing the Next Token

Sampling means randomly choosing the next token based on probabilities.

“mat” has a 62% chance
“banana” still has a 2% chance → creativity unlocked
Unlike greedy decoding, this allows diverse outputs

🎯 Adding Control: Top-k and Top-p Sampling

What is Top-k?

Top-k keeps only the k most probable tokens, discards the rest, and renormalizes.

Top-k = 3 Example:

Keep: ["mat", "couch", "floor"]
Discard: ["bed", "banana"]

What is Top-p (Nucleus Sampling)?

Top-p includes the smallest set of tokens whose cumulative probability ≥ p (e.g., 0.9).

Top-p = 0.9 Example:

"mat"     -> 0.62  
"couch"   -> 0.23  
"floor"   -> 0.08 ✅ (total = 0.93)

Only these are kept for sampling.

How Temperature, Top-k, and Top-p Interact

They are complementary:

Temperature reshapes the probability distribution
Top-k and Top-p filter the candidates
Their effects compound

⚙️ Interaction Table:

Temp	Top-k	Top-p	Behavior
Low	Small	0.8	Deterministic, safe
Medium	Mid	0.9	Balanced creativity
High	Large	0.95	Diverse, possibly risky

🧪 Practical Tips

Use low T + top-p = 0.8 for reliable, fact-based outputs
Use high T + top-p = 0.95 for poetry, storytelling, idea generation
Use greedy decoding (T=0) when accuracy > diversity

🧩 Final Takeaway

Decoding isn’t just a technical step — it’s a creative control lever.
With temperature, top-k, and top-p, you shape how a model speaks:

🎯 Focused and precise?
🎭 Wild and imaginative?
🧠 Balanced and human-like?

Understanding decoding gives you the power to make LLMs work the way you want.

🔢 Step-by-Step: How Decoding Works#

1. Model Outputs Logits#

2. Temperature Scaling with Softmax#

3. Sampling — Choosing the Next Token#

🎯 Adding Control: Top-k and Top-p Sampling#

What is Top-k?#

What is Top-p (Nucleus Sampling)?#

How Temperature, Top-k, and Top-p Interact#

⚙️ Interaction Table:#

🧪 Practical Tips#

🧩 Final Takeaway#