Every time a neural network learns, it asks one question over and over: "If I nudge this parameter slightly, does the error go up or down — and by how much?" That question is answered by the derivative. Before we talk about gradients or optimizers, we need to understand derivatives from scratch.

Part 1 — Lines and Slopes
The Equation of a Line
The simplest relationship between two quantities is a straight line:
Where:
- is the input
- is the output
- is the slope — how steeply the line rises or falls
- is the y-intercept — where the line crosses the vertical axis
Example:
| 0 | 1 |
| 1 | 3 |
| 2 | 5 |
| 3 | 7 |
Every time increases by 1, increases by exactly 2. The slope captures this constant rate.
Computing the Slope Between Two Points
Given any two points and on a line, the slope is:
This is the rise over run formula — how much changes (rise) per unit change in (run).
Why Does Slope Matter?
Slope tells you the rate of change. A slope of 2 means "for every 1 unit step in x, y changes by 2." A slope of −3 means y decreases by 3 for every step forward. A slope of 0 means y doesn't change at all — it's flat.
Part 2 — When Lines Become Curves
A line has a constant slope — it's the same everywhere. But most interesting functions in mathematics (and in machine learning) are curves whose steepness changes at every point.
Consider the parabola:
| −3 | 9 |
| −1 | 1 |
| 0 | 0 |
| 1 | 1 |
| 3 | 9 |
Near the curve is nearly flat. Near it rises steeply. The slope is different at every point — which means the single formula between two distant points only gives us an average.
Average Rate of Change
For two points and on a curve , the average rate of change over that interval is:
This is the slope of the secant line — the straight line connecting the two points on the curve.
Example on between and :
That is the average steepness between and , but it doesn't tell us what the slope is at a specific point.
Part 3 — The Limit: Zooming In to a Single Point
To find the slope at one exact point, we shrink the interval down toward zero. As gets smaller and smaller, the secant line rotates until it becomes the tangent line — touching the curve at exactly one point and matching its steepness there.
Formally, the instantaneous rate of change at is the limit:
This is the core idea of a derivative.
Limits Intuitively
A limit asks: "What value does an expression approach as a variable gets closer and closer to some number — even if it never arrives?"
Expand the numerator:
As :
The slope of at any point is exactly .
Part 4 — The Derivative
Definition
The derivative of a function at point , written or , is:
It gives the instantaneous rate of change — the slope of the tangent line at every point.
Geometric Meaning
| Derivative Value | Meaning |
|---|---|
| Function is increasing at | |
| Function is decreasing at | |
| Function has a flat point (possible minimum, maximum, or saddle) | |
| Large | Function is changing rapidly |
| Small | Function is changing slowly |
Part 5 — Differentiation Rules
Computing limits by hand every time would be exhausting. Mathematicians have derived shortcut rules that cover almost every function you'll encounter.
Power Rule
For :
Examples:
| Function | Derivative |
|---|---|
| (i.e. ) | |
| (constant, ) |
Constant Multiple Rule
If , then .
Sum Rule
If , differentiate term by term:
Chain Rule
For a composition of functions :
Read as: "derivative of outer, evaluated at inner — times derivative of inner."
Example:
Let and :
The chain rule is everywhere in machine learning — backpropagation is essentially repeated application of it through layers of a neural network.
Common Derivatives Reference
| Function | Derivative |
|---|---|
| (sigmoid) |
Move x₀ along the curve, then watch h shrink toward 0 — the red secant line rotates into the green tangent, revealing the derivative as an instantaneous slope.
Controls
Live Computation
Function Curve & Derivative Visualization
Geometric Meaning of the Derivative
The active column highlights the current state of f′(x₀).
f′(x) > 0 — Increasing
The function rises as x increases. The tangent line slopes upward. To find the minimum, gradient descent moves left.
← current
f′(x) = 0 — Flat
The tangent line is horizontal. This is a critical point — possibly a minimum, maximum, or saddle point.
f′(x) < 0 — Decreasing
The function falls as x increases. The tangent line slopes downward. To find the minimum, gradient descent moves right.
Part 6 — Derivatives in Practice
Finding Minima and Maxima
If the function is momentarily flat — this is a critical point. There are three types:
- Local minimum: function dips down then rises → changes from negative to positive
- Local maximum: function rises then dips → changes from positive to negative
- Saddle point: function is flat but continues in the same general direction
Example: Find the minimum of
At : — this is the minimum.
def f(x):
return x**2 - 4*x + 5
def f_prime(x):
return 2*x - 4
# Find where derivative = 0
# 2x - 4 = 0 => x = 2
x_min = 2
print(f"Minimum at x={x_min}, f(x)={f(x_min)}") # x=2, f(x)=1
The Derivative as a Direction Signal
This is the key insight that bridges calculus to machine learning:
If at some point, moving to the right increases . Moving to the left decreases .
If , the opposite is true.
To minimize , we should always move in the direction opposite to the derivative:
Where is a small step size. Notice anything? This is exactly the gradient descent update rule.
Part 7 — From One Variable to Many: The Gradient
Machine learning models have not one parameter, but millions. A loss function might depend on weights . We need derivatives with respect to each parameter simultaneously.
Partial Derivatives
A partial derivative holds all other variables constant and differentiates with respect to one:
Example:
The Gradient Vector
Stack all partial derivatives into a single vector — this is the gradient :
The gradient is the multi-dimensional equivalent of the derivative. It points in the direction of steepest ascent in the loss landscape. To minimize the loss, we move in the opposite direction — exactly what gradient descent does.
The Bridge to Machine Learning
In ML, the loss function $J(\theta)$ measures how wrong the model is. The gradient $\nabla J(\theta)$ tells us which direction in parameter space increases the error most. By stepping in the opposite direction, we reduce the error — step by step, iteration by iteration.
Part 8 — A Complete Example: Linear Regression
Let's see all of this in action.
Setup: We have data points and want to fit .
Loss function (Mean Squared Error):
Partial derivative w.r.t. (using chain rule — derivative of outer squared term times derivative of inner ):
Partial derivative w.r.t. :
Gradient descent updates — move opposite to the gradient:
import numpy as np
# Data: true relationship y = 3x + 2
X = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([5.0, 8.0, 11.0, 14.0, 17.0])
w, b = 0.0, 0.0 # start at zero
alpha = 0.01
m = len(y)
for epoch in range(500):
y_pred = w * X + b # forward pass
error = y_pred - y # residuals: ŷ - y
# Partial derivatives (the gradient)
dw = (2 / m) * np.dot(error, X) # ∂J/∂w
db = (2 / m) * np.sum(error) # ∂J/∂b
# Gradient descent step
w = w - alpha * dw
b = b - alpha * db
print(f"Fitted: ŷ = {w:.4f}·x + {b:.4f}")
# Output: ŷ = 3.0000·x + 2.0000
The derivative — computed analytically with calculus, then applied iteratively — is what drives the entire learning process.
Summary
| Concept | One-Line Definition |
|---|---|
| Slope of a line | — constant rate of change |
| Average rate of change | — slope of secant over interval |
| Limit | The value an expression approaches as |
| Derivative | — instantaneous rate of change |
| Power rule | |
| Chain rule | — essential for backprop |
| Partial derivative | Derivative holding all other variables fixed |
| Gradient | Vector of all partial derivatives — points toward steepest ascent |
The derivative is the mathematical answer to the question "which way is uphill?" In machine learning we use its negative — downhill — to train every model.
What's Next?
You now have the calculus foundation. The gradient descent algorithm takes this one concept — move opposite to the derivative — and turns it into a complete optimization engine for machine learning.
Next Room: Gradient Descent
See how the derivative becomes an optimization algorithm — with interactive experiments, full Python code, and a walk through every step of the math.
Enter the Gradient Descent Room →