Back to Rooms

Derivatives — The Language of Change

From slopes of lines to the calculus engine behind machine learning

Author Avatar

Chau Dara - Founder of TFDevs

April 10, 2026

Updated At: April 10, 2026

Update Summary: N/A

Share this:

AI & ML Bootcamp Batch 3 is open! 20 weeks · Math to MLOps · Build real models in production.

Learn More

Every time a neural network learns, it asks one question over and over: "If I nudge this parameter slightly, does the error go up or down — and by how much?" That question is answered by the derivative. Before we talk about gradients or optimizers, we need to understand derivatives from scratch.


Part 1 — Lines and Slopes

The Equation of a Line

The simplest relationship between two quantities is a straight line:

y=mx+by = mx + b

Where:

  • xx is the input
  • yy is the output
  • mm is the slope — how steeply the line rises or falls
  • bb is the y-intercept — where the line crosses the vertical axis

Example: y=2x+1y = 2x + 1

xxy=2x+1y = 2x + 1
01
13
25
37

Every time xx increases by 1, yy increases by exactly 2. The slope m=2m = 2 captures this constant rate.

Computing the Slope Between Two Points

Given any two points (x1,y1)(x_1, y_1) and (x2,y2)(x_2, y_2) on a line, the slope is:

m=ΔyΔx=y2y1x2x1m = \frac{\Delta y}{\Delta x} = \frac{y_2 - y_1}{x_2 - x_1}

This is the rise over run formula — how much yy changes (rise) per unit change in xx (run).

Why Does Slope Matter?

Slope tells you the rate of change. A slope of 2 means "for every 1 unit step in x, y changes by 2." A slope of −3 means y decreases by 3 for every step forward. A slope of 0 means y doesn't change at all — it's flat.


Part 2 — When Lines Become Curves

A line has a constant slope — it's the same everywhere. But most interesting functions in mathematics (and in machine learning) are curves whose steepness changes at every point.

Consider the parabola:

f(x)=x2f(x) = x^2
xxf(x)=x2f(x) = x^2
−39
−11
00
11
39

Near x=0x = 0 the curve is nearly flat. Near x=3x = 3 it rises steeply. The slope is different at every point — which means the single formula m=ΔyΔxm = \frac{\Delta y}{\Delta x} between two distant points only gives us an average.

Average Rate of Change

For two points xx and x+hx + h on a curve ff, the average rate of change over that interval is:

ΔfΔx=f(x+h)f(x)h\frac{\Delta f}{\Delta x} = \frac{f(x + h) - f(x)}{h}

This is the slope of the secant line — the straight line connecting the two points on the curve.

Example on f(x)=x2f(x) = x^2 between x=1x = 1 and x=3x = 3:

f(3)f(1)31=912=4\frac{f(3) - f(1)}{3 - 1} = \frac{9 - 1}{2} = 4

That is the average steepness between x=1x=1 and x=3x=3, but it doesn't tell us what the slope is at a specific point.


Part 3 — The Limit: Zooming In to a Single Point

To find the slope at one exact point, we shrink the interval hh down toward zero. As hh gets smaller and smaller, the secant line rotates until it becomes the tangent line — touching the curve at exactly one point and matching its steepness there.

Formally, the instantaneous rate of change at xx is the limit:

limh0f(x+h)f(x)h\lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

This is the core idea of a derivative.

Limits Intuitively

A limit asks: "What value does an expression approach as a variable gets closer and closer to some number — even if it never arrives?"

limh0(x+h)2x2h\lim_{h \to 0} \frac{(x+h)^2 - x^2}{h}

Expand the numerator:

=limh0x2+2xh+h2x2h=limh02xh+h2h=limh0(2x+h)= \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} = \lim_{h \to 0} \frac{2xh + h^2}{h} = \lim_{h \to 0} (2x + h)

As h0h \to 0:

=2x= 2x

The slope of f(x)=x2f(x) = x^2 at any point xx is exactly 2x2x.


Part 4 — The Derivative

Definition

The derivative of a function ff at point xx, written f(x)f'(x) or dfdx\frac{df}{dx}, is:

f(x)=limh0f(x+h)f(x)h\boxed{f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}}

It gives the instantaneous rate of change — the slope of the tangent line at every point.

Geometric Meaning

Derivative ValueMeaning
f(x)>0f'(x) > 0Function is increasing at xx
f(x)<0f'(x) < 0Function is decreasing at xx
f(x)=0f'(x) = 0Function has a flat point (possible minimum, maximum, or saddle)
Large f(x)\|f'(x)\|Function is changing rapidly
Small f(x)\|f'(x)\|Function is changing slowly

Part 5 — Differentiation Rules

Computing limits by hand every time would be exhausting. Mathematicians have derived shortcut rules that cover almost every function you'll encounter.

Power Rule

For f(x)=xnf(x) = x^n:

ddxxn=nxn1\frac{d}{dx} x^n = n \cdot x^{n-1}

Examples:

FunctionDerivative
x2x^22x2x
x3x^33x23x^2
x10x^{10}10x910x^9
xx (i.e. x1x^1)11
55 (constant, x0x^0)00

Constant Multiple Rule

ddx[cf(x)]=cf(x)\frac{d}{dx}[c \cdot f(x)] = c \cdot f'(x)

If f(x)=3x2f(x) = 3x^2, then f(x)=32x=6xf'(x) = 3 \cdot 2x = 6x.

Sum Rule

ddx[f(x)+g(x)]=f(x)+g(x)\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)

If f(x)=x3+5x22x+7f(x) = x^3 + 5x^2 - 2x + 7, differentiate term by term:

f(x)=3x2+10x2f'(x) = 3x^2 + 10x - 2

Chain Rule

For a composition of functions f(g(x))f(g(x)):

ddxf(g(x))=f(g(x))g(x)\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)

Read as: "derivative of outer, evaluated at inner — times derivative of inner."

Example: h(x)=(3x+1)4h(x) = (3x + 1)^4

Let g(x)=3x+1g(x) = 3x + 1 and f(u)=u4f(u) = u^4:

h(x)=4(3x+1)33=12(3x+1)3h'(x) = 4(3x+1)^3 \cdot 3 = 12(3x+1)^3

The chain rule is everywhere in machine learning — backpropagation is essentially repeated application of it through layers of a neural network.

Common Derivatives Reference

FunctionDerivative
exe^xexe^x
ln(x)\ln(x)1x\frac{1}{x}
sin(x)\sin(x)cos(x)\cos(x)
cos(x)\cos(x)sin(x)-\sin(x)
σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}} (sigmoid)σ(x)(1σ(x))\sigma(x)(1 - \sigma(x))

Move x₀ along the curve, then watch h shrink toward 0 — the red secant line rotates into the green tangent, revealing the derivative as an instantaneous slope.

Controls

-4.70.04.7
0.0011.53

Live Computation

x₀2.000
f(x₀)4.0000
f′(x₀) — exact4.0000
h1.5000
[f(x₀+h) − f(x₀)] / h5.5000
Approx error1.50000
↗ Increasing

Function Curve & Derivative Visualization

Geometric Meaning of the Derivative

The active column highlights the current state of f′(x₀).

f′(x) > 0 — Increasing

The function rises as x increases. The tangent line slopes upward. To find the minimum, gradient descent moves left.

← current

f′(x) = 0 — Flat

The tangent line is horizontal. This is a critical point — possibly a minimum, maximum, or saddle point.

f′(x) < 0 — Decreasing

The function falls as x increases. The tangent line slopes downward. To find the minimum, gradient descent moves right.


Part 6 — Derivatives in Practice

Finding Minima and Maxima

If f(x)=0f'(x) = 0 the function is momentarily flat — this is a critical point. There are three types:

  • Local minimum: function dips down then rises → f(x)f'(x) changes from negative to positive
  • Local maximum: function rises then dips → f(x)f'(x) changes from positive to negative
  • Saddle point: function is flat but continues in the same general direction

Example: Find the minimum of f(x)=x24x+5f(x) = x^2 - 4x + 5

f(x)=2x4=0    x=2f'(x) = 2x - 4 = 0 \implies x = 2

At x=2x = 2: f(2)=48+5=1f(2) = 4 - 8 + 5 = 1 — this is the minimum.

def f(x):
    return x**2 - 4*x + 5

def f_prime(x):
    return 2*x - 4

# Find where derivative = 0
# 2x - 4 = 0  =>  x = 2
x_min = 2
print(f"Minimum at x={x_min}, f(x)={f(x_min)}")  # x=2, f(x)=1

The Derivative as a Direction Signal

This is the key insight that bridges calculus to machine learning:

If f(x)>0f'(x) > 0 at some point, moving xx to the right increases ff. Moving xx to the left decreases ff.

If f(x)<0f'(x) < 0, the opposite is true.

To minimize ff, we should always move xx in the direction opposite to the derivative:

xnew=xoldαf(xold)x_{\text{new}} = x_{\text{old}} - \alpha \cdot f'(x_{\text{old}})

Where α\alpha is a small step size. Notice anything? This is exactly the gradient descent update rule.


Part 7 — From One Variable to Many: The Gradient

Machine learning models have not one parameter, but millions. A loss function JJ might depend on weights w1,w2,,wnw_1, w_2, \ldots, w_n. We need derivatives with respect to each parameter simultaneously.

Partial Derivatives

A partial derivative holds all other variables constant and differentiates with respect to one:

Jwi= "how much does J change if we nudge only wi?"\frac{\partial J}{\partial w_i} \quad \text{= "how much does J change if we nudge only } w_i \text{?"}

Example: J(w1,w2)=w12+3w1w2+w22J(w_1, w_2) = w_1^2 + 3w_1 w_2 + w_2^2

Jw1=2w1+3w2Jw2=3w1+2w2\frac{\partial J}{\partial w_1} = 2w_1 + 3w_2 \qquad \frac{\partial J}{\partial w_2} = 3w_1 + 2w_2

The Gradient Vector

Stack all partial derivatives into a single vector — this is the gradient J\nabla J:

J(w1,w2,,wn)=[Jw1Jw2Jwn]\nabla J(w_1, w_2, \ldots, w_n) = \begin{bmatrix} \frac{\partial J}{\partial w_1} \\[4pt] \frac{\partial J}{\partial w_2} \\ \vdots \\[4pt] \frac{\partial J}{\partial w_n} \end{bmatrix}

The gradient is the multi-dimensional equivalent of the derivative. It points in the direction of steepest ascent in the loss landscape. To minimize the loss, we move in the opposite direction — exactly what gradient descent does.

The Bridge to Machine Learning

In ML, the loss function $J(\theta)$ measures how wrong the model is. The gradient $\nabla J(\theta)$ tells us which direction in parameter space increases the error most. By stepping in the opposite direction, we reduce the error — step by step, iteration by iteration.


Part 8 — A Complete Example: Linear Regression

Let's see all of this in action.

Setup: We have data points (x(i),y(i))(x^{(i)}, y^{(i)}) and want to fit y^=wx+b\hat{y} = wx + b.

Loss function (Mean Squared Error):

J(w,b)=1mi=1m(y^(i)y(i))2=1mi=1m(wx(i)+by(i))2J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \left(\hat{y}^{(i)} - y^{(i)}\right)^2 = \frac{1}{m} \sum_{i=1}^{m} \left(wx^{(i)} + b - y^{(i)}\right)^2

Partial derivative w.r.t. ww (using chain rule — derivative of outer squared term times derivative of inner wx+bwx+b):

Jw=2mi=1m(wx(i)+by(i))x(i)\frac{\partial J}{\partial w} = \frac{2}{m} \sum_{i=1}^{m} \left(wx^{(i)} + b - y^{(i)}\right) \cdot x^{(i)}

Partial derivative w.r.t. bb:

Jb=2mi=1m(wx(i)+by(i))\frac{\partial J}{\partial b} = \frac{2}{m} \sum_{i=1}^{m} \left(wx^{(i)} + b - y^{(i)}\right)

Gradient descent updates — move opposite to the gradient:

wwαJw,bbαJbw \leftarrow w - \alpha \cdot \frac{\partial J}{\partial w}, \qquad b \leftarrow b - \alpha \cdot \frac{\partial J}{\partial b}
import numpy as np

# Data: true relationship y = 3x + 2
X = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([5.0, 8.0, 11.0, 14.0, 17.0])

w, b = 0.0, 0.0   # start at zero
alpha = 0.01
m = len(y)

for epoch in range(500):
    y_pred = w * X + b              # forward pass
    error  = y_pred - y             # residuals: ŷ - y

    # Partial derivatives (the gradient)
    dw = (2 / m) * np.dot(error, X) # ∂J/∂w
    db = (2 / m) * np.sum(error)    # ∂J/∂b

    # Gradient descent step
    w = w - alpha * dw
    b = b - alpha * db

print(f"Fitted: ŷ = {w:.4f}·x + {b:.4f}")
# Output: ŷ = 3.0000·x + 2.0000

The derivative — computed analytically with calculus, then applied iteratively — is what drives the entire learning process.


Summary

ConceptOne-Line Definition
Slope of a linem=ΔyΔxm = \frac{\Delta y}{\Delta x} — constant rate of change
Average rate of changef(x+h)f(x)h\frac{f(x+h)-f(x)}{h} — slope of secant over interval hh
LimitThe value an expression approaches as h0h \to 0
Derivativef(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h\to 0}\frac{f(x+h)-f(x)}{h} — instantaneous rate of change
Power ruleddxxn=nxn1\frac{d}{dx} x^n = nx^{n-1}
Chain ruleddxf(g(x))=f(g(x))g(x)\frac{d}{dx}f(g(x)) = f'(g(x))\cdot g'(x) — essential for backprop
Partial derivativeDerivative holding all other variables fixed
GradientVector of all partial derivatives — points toward steepest ascent

The derivative is the mathematical answer to the question "which way is uphill?" In machine learning we use its negative — downhill — to train every model.


What's Next?

You now have the calculus foundation. The gradient descent algorithm takes this one concept — move opposite to the derivative — and turns it into a complete optimization engine for machine learning.

Next Room: Gradient Descent

See how the derivative becomes an optimization algorithm — with interactive experiments, full Python code, and a walk through every step of the math.

Enter the Gradient Descent Room →