The Math Behind Backpropagation in Neural Networks
Backpropagation is the algorithm used to train neural networks by computing how the loss changes with respect to every weight. It relies heavily on calculus, particularly the chain rule, to efficiently compute gradients.
Backprop ensures that weights update in the direction that reduces the error.
1. Basic Idea of Backpropagation
Goal:
Minimize the loss function
๐ฟ
L by adjusting weights
๐ค
w.
We use gradient descent:
๐ค
←
๐ค
−
๐
∂
๐ฟ
∂
๐ค
w←w−ฮท
∂w
∂L
So the key is computing the gradient
∂
๐ฟ
∂
๐ค
∂w
∂L
.
Backpropagation calculates these gradients efficiently by applying the chain rule backward through the network.
2. Components of a Neural Network
Consider a single neuron:
๐ง
=
๐ค
๐
๐ฅ
+
๐
z=w
T
x+b
๐
=
๐
(
๐ง
)
a=ฯ(z)
Where:
๐ฅ
x = inputs
๐ค
w = weights
๐
b = bias
๐
ฯ = activation function (sigmoid, ReLU, tanh, etc.)
๐
a = output
3. The Chain Rule — Core of Backpropagation
If a variable
๐ฆ
y depends on
๐ข
u, and
๐ข
u depends on
๐ฅ
x:
∂
๐ฆ
∂
๐ฅ
=
∂
๐ฆ
∂
๐ข
⋅
∂
๐ข
∂
๐ฅ
∂x
∂y
=
∂u
∂y
⋅
∂x
∂u
Neural networks repeatedly apply this through many layers.
4. Backprop in a Simple 2-Layer Network
Consider:
Forward Pass
๐ง
(
1
)
=
๐
(
1
)
๐ฅ
+
๐
(
1
)
,
๐
(
1
)
=
๐
(
๐ง
(
1
)
)
z
(1)
=W
(1)
x+b
(1)
,a
(1)
=ฯ(z
(1)
)
๐ง
(
2
)
=
๐
(
2
)
๐
(
1
)
+
๐
(
2
)
,
๐
(
2
)
=
๐ฆ
^
z
(2)
=W
(2)
a
(1)
+b
(2)
,a
(2)
=
y
^
Loss (e.g., MSE):
๐ฟ
=
1
2
(
๐ฆ
−
๐ฆ
^
)
2
L=
2
1
(y−
y
^
)
2
5. Computing Gradients Step-by-Step
5.1 Gradient w.r.t. Output Layer Activation
∂
๐ฟ
∂
๐ฆ
^
=
๐ฆ
^
−
๐ฆ
∂
y
^
∂L
=
y
^
−y
(From derivative of MSE.)
5.2 Gradient w.r.t. Output Layer Pre-activation
๐ง
(
2
)
z
(2)
∂
๐ฟ
∂
๐ง
(
2
)
=
∂
๐ฟ
∂
๐ฆ
^
⋅
∂
๐ฆ
^
∂
๐ง
(
2
)
∂z
(2)
∂L
=
∂
y
^
∂L
⋅
∂z
(2)
∂
y
^
If activation is sigmoid:
∂
๐ฆ
^
∂
๐ง
(
2
)
=
๐ฆ
^
(
1
−
๐ฆ
^
)
∂z
(2)
∂
y
^
=
y
^
(1−
y
^
)
So:
๐ฟ
(
2
)
=
∂
๐ฟ
∂
๐ง
(
2
)
=
(
๐ฆ
^
−
๐ฆ
)
⋅
๐ฆ
^
(
1
−
๐ฆ
^
)
ฮด
(2)
=
∂z
(2)
∂L
=(
y
^
−y)⋅
y
^
(1−
y
^
)
This term
๐ฟ
(
2
)
ฮด
(2)
is the error signal for the output layer.
5.3 Gradient w.r.t. Output Layer Weights
∂
๐ฟ
∂
๐
(
2
)
=
๐ฟ
(
2
)
⋅
(
๐
(
1
)
)
๐
∂W
(2)
∂L
=ฮด
(2)
⋅(a
(1)
)
T
And for biases:
∂
๐ฟ
∂
๐
(
2
)
=
๐ฟ
(
2
)
∂b
(2)
∂L
=ฮด
(2)
6. Backpropagating to Hidden Layer
We now find how the error affects the previous layer.
6.1 Hidden layer error signal
๐ฟ
(
1
)
=
(
๐
(
2
)
๐
๐ฟ
(
2
)
)
⊙
๐
′
(
๐ง
(
1
)
)
ฮด
(1)
=(W
(2)T
ฮด
(2)
)⊙ฯ
′
(z
(1)
)
Multiply by weights going forward
Multiply by the derivative of activation
Example: sigmoid derivative:
๐
′
(
๐ง
(
1
)
)
=
๐
(
1
)
(
1
−
๐
(
1
)
)
ฯ
′
(z
(1)
)=a
(1)
(1−a
(1)
)
6.2 Hidden layer weight gradient
∂
๐ฟ
∂
๐
(
1
)
=
๐ฟ
(
1
)
⋅
๐ฅ
๐
∂W
(1)
∂L
=ฮด
(1)
⋅x
T
And for biases:
∂
๐ฟ
∂
๐
(
1
)
=
๐ฟ
(
1
)
∂b
(1)
∂L
=ฮด
(1)
7. Full Backprop Summary
Forward pass:
Compute all activations layer by layer.
Backward pass:
Compute output error
๐ฟ
(
๐ฟ
)
ฮด
(L)
For each layer going backward:
๐ฟ
(
๐
)
=
(
๐
(
๐
+
1
)
๐
๐ฟ
(
๐
+
1
)
)
⊙
๐
′
(
๐ง
(
๐
)
)
ฮด
(l)
=(W
(l+1)T
ฮด
(l+1)
)⊙ฯ
′
(z
(l)
)
Weight gradients:
∂
๐ฟ
∂
๐
(
๐
)
=
๐ฟ
(
๐
)
(
๐
(
๐
−
1
)
)
๐
∂W
(l)
∂L
=ฮด
(l)
(a
(l−1)
)
T
Bias gradients:
∂
๐ฟ
∂
๐
(
๐
)
=
๐ฟ
(
๐
)
∂b
(l)
∂L
=ฮด
(l)
These gradients are then used to update weights and biases.
8. Matrix Form of Backprop (General Case)
For layer
๐
l:
Error term:
๐ฟ
(
๐
)
=
(
๐
(
๐
+
1
)
๐
๐ฟ
(
๐
+
1
)
)
⊙
๐
′
(
๐ง
(
๐
)
)
ฮด
(l)
=(W
(l+1)T
ฮด
(l+1)
)⊙ฯ
′
(z
(l)
)
Weight gradient:
∇
๐
(
๐
)
๐ฟ
=
๐ฟ
(
๐
)
(
๐
(
๐
−
1
)
)
๐
∇
W
(l)
L=ฮด
(l)
(a
(l−1)
)
T
Bias gradient:
∇
๐
(
๐
)
๐ฟ
=
๐ฟ
(
๐
)
∇
b
(l)
L=ฮด
(l)
This compact form allows training deep networks efficiently.
9. Why Backpropagation Works Efficiently
Avoids recomputing repeated derivatives
Uses cached values from forward pass
Operates layer-by-layer using matrix multiplications
Scales well with modern hardware (GPUs/TPUs)
Without backpropagation, training deep networks would be computationally impossible.
10. Common Activation Function Derivatives
Sigmoid:
๐
′
(
๐ง
)
=
๐
(
๐ง
)
(
1
−
๐
(
๐ง
)
)
ฯ
′
(z)=ฯ(z)(1−ฯ(z))
ReLU:
ReLU
′
(
๐ง
)
=
{
1
๐ง
>
0
0
๐ง
≤
0
ReLU
′
(z)={
1
0
z>0
z≤0
Tanh:
tanh
′
(
๐ง
)
=
1
−
tanh
2
(
๐ง
)
tanh
′
(z)=1−tanh
2
(z)
These derivatives are used repeatedly during backprop.
Conclusion
Backpropagation is simply repeated application of the chain rule over a layered computation graph. It calculates gradients of the loss function with respect to every weight in a neural network efficiently.
This mathematical foundation enables deep learning to train millions or even billions of parameters and is the core engine behind modern AI advances.
Learn Data Science Course in Hyderabad
Read More
Advanced Machine Learning & Deep Learning
Quantum Computing and Its Future Role in Data Science
A Guide to Explainable AI (XAI)
The Power of Geospatial Data Analysis
Visit Our Quality Thought Training Institute in Hyderabad
Subscribe by Email
Follow Updates Articles from This Blog via Email
No Comments