The MLP Block
The feed-forward network in transformers
The MLP Block
After attention, each position goes through a feed-forward network (also called MLP - Multi-Layer Perceptron). This processes the information gathered by attention.
What Does the MLP Do?
The MLP applies a simple transformation to each position independently:
# Expand dimension (4x)
x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
# Apply square ReLU activation
x = [xi.relu() ** 2 for xi in x]
# Contract back to original dimension
x = linear(x, state_dict[f'layer{li}.mlp_fc2'])Step by Step
Step 1: Expand
Input (n_embd=16): [x0, x1, ..., x15]
↓ linear (4× expansion)
Output: [x0, ..., x63] (64 = 4 × 16)The vector grows 4× larger. This gives the model more "capacity" to process information.
Step 2: Activate (Square ReLU)
x = [xi.relu() ** 2 for xi in x]This applies square ReLU:
- If x > 0: output = x²
- If x ≤ 0: output = 0
Input: [2.0, -1.0, 0.5, -0.5]
Output: [4.0, 0 , 0.25, 0 ]This introduces nonlinearity - allowing the network to learn complex patterns.
Step 3: Contract
Input (64): [x0, ..., x63]
↓ linear (4× contraction)
Output: [x0, ..., x15] (back to 16)The vector shrinks back to its original size.
Why Expand and Contract?
This pattern (expand → nonlinear → contract) is common in neural networks:
| Benefit | Explanation |
|---|---|
| More parameters | More weights to learn from |
| Non-linearity | The activation function adds complexity |
| Information processing | Expand gives "thinking space", contract summarizes |
Square ReLU
microgpt uses a specific activation: Square ReLU (ReLU squared):
def relu(self):
return max(0, self) # Standard ReLU
# Then squared:
output = relu(x) ** 2This is different from standard ReLU or GeLU (used in GPT-2). It's simpler!
The MLP in Context
Here's how the MLP fits into a transformer block:
# 1. Attention block
x_residual = x # Save for residual
x = attention_block(x) # Process with attention
x = x + x_residual # Add residual
# 2. MLP block
x_residual = x # Save for residual
x = rmsnorm(x) # Normalize
x = mlp_block(x) # Process with MLP
x = x + x_residual # Add residualEach block has: Attention → MLP, both with residual connections.
Visual Representation
Input x
│
▼
┌────────────────────────────────────┐
│ 1. Linear (expand) │
│ 16 → 64 dimensions │
│ │
│ 2. Square ReLU │
│ nonlinearity │
│ │
│ 3. Linear (contract) │
│ 64 → 16 dimensions │
└────────────────────────────────────┘
│
▼
Output x (same size as input)Why Per-Position?
The MLP processes each position independently:
- Attention shares information between positions
- MLP processes each position's information locally
This is a key distinction:
- Attention: Information from multiple positions
- MLP: Information from a single position
Summary
The MLP block in transformers:
- Expands dimension 4× for more capacity
- Applies square ReLU for non-linearity
- Contracts back to original dimension
- Processes each position independently
This gives the model extra capacity to process the information gathered by attention!