The Training Loop

The training loop is where the model actually learns. It repeatedly:

Takes examples from the data
Makes predictions
Computes how wrong it was
Updates weights to be less wrong next time

Visual: The Training Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                      TRAINING LOOP                               │
│                                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐ │
│  │  Input   │───►│ Forward  │───►│  Loss   │───►│ Backward│ │
│  │ "emma"   │    │  Pass    │    │ Compute │    │  Pass   │ │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘ │
│                                                 │              │
│                                                 ▼              │
│                                          ┌──────────┐         │
│                                          │  Adam    │         │
│                                          │ Update   │         │
│                                          └──────────┘         │
│                                               │              │
│                                               ▼              │
│                                          ┌──────────┐         │
│                                          │New Weights│◄────────┘
│                                          └──────────┘         │
└─────────────────────────────────────────────────────────────────┘

The Training Code

Here's the training loop from microgpt:

for step in range(args.num_steps):

    # 1. Get a training example
    doc = docs[step % len(docs)]
    tokens = [BOS] + [stoi[ch] for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    # 2. Forward pass for each position
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)

    # 3. Compute average loss
    loss = (1 / n) * sum(losses)

    # 4. Backward pass
    loss.backward()

    # 5. Update weights with Adam
    lr_t = learning_rate * (1 - step / args.num_steps)
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0

    # 6. Print progress
    print(f"step {step+1} / {args.num_steps} | loss {loss.data:.4f}")

Let's break this down step by step!

Step 1: Get Training Data

doc = docs[step % len(docs)]
tokens = [BOS] + [stoi[ch] for ch in doc] + [BOS]
n = min(block_size, len(tokens) - 1)

We pick one name from the dataset:

docs[step % len(docs)] cycles through all names
Wrap with BOS tokens: <BOS>` emma `<BOS>
n is how many prediction steps we'll make

Step 2: Forward Pass

keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
    token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
    logits = gpt(token_id, pos_id, keys, values)
    probs = softmax(logits)
    loss_t = -probs[target_id].log()
    losses.append(loss_t)

For each position in the sequence:

Input: current token
Target: next token
Compute loss (how wrong the prediction was)
Store loss for averaging

Step 3: Compute Loss

loss = (1 / n) * sum(losses)

Average the losses across all positions in the sequence.

Step 4: Backward Pass

loss.backward()

This is where the magic happens! The autograd engine:

Traces back through all operations
Computes the gradient for each parameter
Tells us: "how should we change each weight to reduce loss?"

Step 5: Update Weights

lr_t = learning_rate * (1 - step / args.num_steps)
for i, p in enumerate(params):
    # Adam optimizer update
    p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
    p.grad = 0

Use the gradients to update each parameter. We'll cover the Adam optimizer in detail in the next section.

The learning rate decreases over time (linearly to 0).

Step 6: Print Progress

print(f"step {step+1} / {args.num_steps} | loss {loss.data:.4f}")

Show how training is going.

What You'll See

When you run python microgpt.py:

vocab size: 26, num docs: 32033
num params: 4125
step 1 / 1000 | loss 3.2589
step 2 / 1000 | loss 3.1892
step 3 / 1000 | loss 3.1421
...
step 1000 / 1000 | loss 1.4234

The loss should go down over time!

Training Visualization

Epoch 1:   loss = 3.2   (random guessing)
   ↓
Epoch 250: loss = 2.1   (starting to learn patterns)
   ↓
Epoch 500: loss = 1.5   (pretty good)
   ↓
Epoch 1000: loss = 1.2   (trained!)

The model starts knowing nothing and gradually learns to predict names!

Summary

The training loop:

Pick a training example
Forward pass to get predictions
Compute loss (how wrong we were)
Backward pass to compute gradients
Update weights to reduce loss
Repeat millions of times!

This is the core of how all neural networks learn!

The Training Loop

On this page