Understanding Gradients
How the backward pass learns from mistakes
Understanding Gradients
In the last section, we saw how operations have gradients. Now let's understand what gradients actually are and why they're useful.
The Intuition: Learning from Mistakes
Imagine you're trying to hit a target:
Target: 0
Your prediction: 10
Difference (error): 10To get closer to the target, you should adjust your prediction. But how much should you adjust it?
That's what the gradient tells you!
What is a Gradient?
A gradient is simply the answer to this question:
"If I change this input slightly, how much does the output change?"
Gradient = "How much does output change" / "How much did I change the input"In math notation: ∂output / ∂input
Concrete Example
Say we have:
y = 2 * xIf x = 3, then y = 6.
Now, if we change x by a tiny amount (say, +0.001):
- New x = 3.001
- New y = 2 * 3.001 = 6.002
So:
- Input changed by: 0.001
- Output changed by: 0.002
- Gradient = 0.002 / 0.001 = 2
The gradient is 2!
This means: "if I increase x, y increases twice as fast."
Positive vs Negative Gradient
| Gradient | Meaning | Action |
|---|---|---|
| Positive | Increasing input increases loss | Decrease the weight |
| Negative | Increasing input decreases loss | Increase the weight |
| Zero | Input doesn't affect output | Don't change |
The Chain Rule
When we have multiple operations chained together:
a → [×2] → b → [+3] → cThe gradient of c with respect to a is:
∂c/∂a = ∂c/∂b × ∂b/∂aThis is called the chain rule. It's how we trace gradients through complex computations.
The Backward Pass
The backward pass computes gradients from output to input:
def backward(self):
# Step 1: Build topological order
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
# Step 2: Apply chain rule in reverse
self.grad = 1
for v in reversed(topo):
v._backward()Why Reverse Order?
We start from the output (where we know the gradient is 1) and work backward:
Output gradient = 1
↓
Backward through last operation
↓
Backward through second-to-last
↓
...
↓
Input gradientThis is because each step needs the gradient from the next step.
Visual Example
Let's trace through a simple computation:
a = Value(2.0, _op='a')
b = Value(3.0, _op='b')
c = a * b # c = 6
d = c + a # d = 8
d.backward()The backward pass:
d.grad = 1
Step 1: c + a backward
a.grad += 1 (from c + a)
c.grad += 1 (from c + a)
Step 2: a * b backward
a.grad += b.data * c.grad = 3.0 * 1 = 3.0
b.grad += a.data * c.grad = 2.0 * 1 = 2.0
Final gradients:
a.grad = 1 + 3 = 4
b.grad = 2
c.grad = 1Gradient Descent
Once we have gradients, we use them to update weights:
# The update rule
weight.data -= learning_rate * weight.gradThis is called gradient descent:
- We go "downhill" (reduce loss)
- The step size is the learning rate
New weight = Old weight - (learning rate × gradient)Why We Need Gradients
Without gradients, we'd have to:
- Guess random changes to weights
- See if it helps
- Repeat billions of times
With gradients, we know exactly:
- Which direction helps
- How much to change
That's the magic of neural network training!
Summary
A gradient tells you:
- Direction: Which way to change the weight
- Magnitude: How much to change it
The backward pass uses the chain rule to compute gradients through the entire computation graph.
Gradient descent uses these gradients to update weights and reduce loss.