Inference & Generation
How the trained model generates new text
Inference & Generation
After training, we can use the model to generate new text! This is called inference or generation.
The Generation Code
Here's the inference code from microgpt:
temperature = 0.5
print("\n--- generation ---")
for sample_idx in range(5):
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
token_id = BOS
print(f"sample {sample_idx}: ", end="")
for pos_id in range(block_size):
logits = gpt(token_id, pos_id, keys, values)
probs = softmax([l / temperature for l in logits])
token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
if token_id == BOS:
break
print(itos[token_id], end="")
print()Let's break this down!
Starting Generation
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
token_id = BOSWe start with:
- Empty keys and values (no context yet)
- The
<BOS>token (beginning of sequence)
The Generation Loop
for pos_id in range(block_size):
logits = gpt(token_id, pos_id, keys, values)
probs = softmax([l / temperature for l in logits])
token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
if token_id == BOS:
break
print(itos[token_id], end="")For each position:
- Forward pass: Get logits for next token
- Apply temperature: Divide by temperature
- Softmax: Convert to probabilities
- Sample: Pick next token randomly
- Print: Show the character
- Repeat: Continue until BOS or max length
Temperature
The temperature controls randomness:
probs = softmax([l / temperature for l in logits])| Temperature | Effect | Example Output |
|---|---|---|
| 0.1 | Very deterministic | "emma" (always picks best) |
| 0.5 | Balanced | "emma", "emily", "emma" |
| 1.0 | More random | "emxa", "emma", "emua" |
| 2.0 | Very random | "xzqw", "emmm", "avva" |
Low Temperature
Dividing by a small number (like 0.1) makes the largest logit even larger:
logits: [2.0, 1.0, 0.1]
temp 1.0: softmax([2.0, 1.0, 0.1]) = [0.65, 0.24, 0.11]
temp 0.1: softmax([20, 10, 1]) = [0.999, 0.00004, 0.0]The model becomes very confident (greedy).
High Temperature
Dividing by a large number makes all logits similar:
temp 2.0: softmax([1.0, 0.5, 0.05]) = [0.42, 0.35, 0.23]The model becomes more random.
Sampling
token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]We don't just pick the most likely token (that would be greedy). Instead, we sample from the probability distribution.
This gives us variety!
Probabilities: [0.65, 0.24, 0.11]
- 65% of the time: pick index 0
- 24% of the time: pick index 1
- 11% of the time: pick index 2Stopping Conditions
if token_id == BOS:
breakWe stop when:
- The model predicts
<BOS>(end of sequence) - Or we reach
block_sizepositions
What Gets Generated
When you run python microgpt.py, after training you'll see:
--- generation ---
sample 0: emma
sample 1: ava
sample 2: olivi
sample 3: ela
sample 4: miaThese are new names the model invented!
How Generation Works
Step 1: Start with `<BOS>`
`<BOS>` → model predicts 'e' (high probability)
Step 2: Input is now `<BOS>` e
'e' → model predicts 'm' (high probability)
Step 3: Input is now `<BOS>` e m
'm' → model predicts 'm' (high probability)
Step 4: Input is now `<BOS>` e m m
'm' → model predicts 'a' (high probability)
Step 5: Input is now `<BOS>` e m m a
'a' → model predicts `<BOS>` (end!)The model generates left-to-right, using its previous predictions as context!
Training vs Inference
| Aspect | Training | Inference |
|---|---|---|
| Input | Real data | Model's own output |
| Keys/Values | Stored from data | Accumulated |
| Temperature | Not used | Can vary |
| Purpose | Learn from examples | Generate new text |
Summary
Inference generates new text:
- Start with
<BOS>token - Predict next token probabilities
- Apply temperature for randomness
- Sample from the distribution
- Repeat until
<BOS>or max length
This is how the trained model creates new names, sentences, or any text!