Training
The Training Loop
How the model learns from data
The Training Loop
The training loop is where the model actually learns. It repeatedly:
- Takes examples from the data
- Makes predictions
- Computes how wrong it was
- Updates weights to be less wrong next time
Visual: The Training Pipeline
┌─────────────────────────────────────────────────────────────────┐
│ TRAINING LOOP │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Input │───►│ Forward │───►│ Loss │───►│ Backward│ │
│ │ "emma" │ │ Pass │ │ Compute │ │ Pass │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Adam │ │
│ │ Update │ │
│ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │New Weights│◄────────┘
│ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘The Training Code
Here's the training loop from microgpt:
for step in range(args.num_steps):
# 1. Get a training example
doc = docs[step % len(docs)]
tokens = [BOS] + [stoi[ch] for ch in doc] + [BOS]
n = min(block_size, len(tokens) - 1)
# 2. Forward pass for each position
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)
# 3. Compute average loss
loss = (1 / n) * sum(losses)
# 4. Backward pass
loss.backward()
# 5. Update weights with Adam
lr_t = learning_rate * (1 - step / args.num_steps)
for i, p in enumerate(params):
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1))
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
p.grad = 0
# 6. Print progress
print(f"step {step+1} / {args.num_steps} | loss {loss.data:.4f}")Let's break this down step by step!
Step 1: Get Training Data
doc = docs[step % len(docs)]
tokens = [BOS] + [stoi[ch] for ch in doc] + [BOS]
n = min(block_size, len(tokens) - 1)We pick one name from the dataset:
docs[step % len(docs)]cycles through all names- Wrap with BOS tokens:
<BOS>` emma `<BOS> nis how many prediction steps we'll make
Step 2: Forward Pass
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)For each position in the sequence:
- Input: current token
- Target: next token
- Compute loss (how wrong the prediction was)
- Store loss for averaging
Step 3: Compute Loss
loss = (1 / n) * sum(losses)Average the losses across all positions in the sequence.
Step 4: Backward Pass
loss.backward()This is where the magic happens! The autograd engine:
- Traces back through all operations
- Computes the gradient for each parameter
- Tells us: "how should we change each weight to reduce loss?"
Step 5: Update Weights
lr_t = learning_rate * (1 - step / args.num_steps)
for i, p in enumerate(params):
# Adam optimizer update
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
p.grad = 0Use the gradients to update each parameter. We'll cover the Adam optimizer in detail in the next section.
The learning rate decreases over time (linearly to 0).
Step 6: Print Progress
print(f"step {step+1} / {args.num_steps} | loss {loss.data:.4f}")Show how training is going.
What You'll See
When you run python microgpt.py:
vocab size: 26, num docs: 32033
num params: 4125
step 1 / 1000 | loss 3.2589
step 2 / 1000 | loss 3.1892
step 3 / 1000 | loss 3.1421
...
step 1000 / 1000 | loss 1.4234The loss should go down over time!
Training Visualization
Epoch 1: loss = 3.2 (random guessing)
↓
Epoch 250: loss = 2.1 (starting to learn patterns)
↓
Epoch 500: loss = 1.5 (pretty good)
↓
Epoch 1000: loss = 1.2 (trained!)The model starts knowing nothing and gradually learns to predict names!
Summary
The training loop:
- Pick a training example
- Forward pass to get predictions
- Compute loss (how wrong we were)
- Backward pass to compute gradients
- Update weights to reduce loss
- Repeat millions of times!
This is the core of how all neural networks learn!