The Math Behind Backpropagation in Neural Networks

Backpropagation is the algorithm used to train neural networks by computing how the loss changes with respect to every weight. It relies heavily on calculus, particularly the chain rule, to efficiently compute gradients.

Backprop ensures that weights update in the direction that reduces the error.

1. Basic Idea of Backpropagation

Goal:

Minimize the loss function

𝐿

L by adjusting weights

𝑤

We use gradient descent:

𝑤

←

𝑤

−

𝜂

∂

𝐿

∂

𝑤

w←w−η

∂w

∂L

So the key is computing the gradient

∂

𝐿

∂

𝑤

∂w

∂L

Backpropagation calculates these gradients efficiently by applying the chain rule backward through the network.

2. Components of a Neural Network

Consider a single neuron:

𝑧

𝑤

𝑇

𝑥

𝑏

z=w

x+b

𝑎

𝜎

(

𝑧

)

a=σ(z)

Where:

𝑥

x = inputs

𝑤

w = weights

𝑏

b = bias

𝜎

σ = activation function (sigmoid, ReLU, tanh, etc.)

𝑎

a = output

3. The Chain Rule — Core of Backpropagation

If a variable

𝑦

y depends on

𝑢

u, and

𝑢

u depends on

𝑥

∂

𝑦

∂

𝑥

∂

𝑦

∂

𝑢

⋅

∂

𝑢

∂

𝑥

∂x

∂y

∂u

∂y

⋅

∂x

∂u

Neural networks repeatedly apply this through many layers.

4. Backprop in a Simple 2-Layer Network

Consider:

Forward Pass

𝑧

(

)

𝑊

(

)

𝑥

𝑏

(

)

𝑎

(

)

𝜎

(

𝑧

(

)

(1)

x+b

(1)

=σ(z

(1)

)

𝑧

(

)

𝑊

(

)

𝑎

(

)

𝑏

(

)

𝑎

(

)

𝑦

(2)

(1)

(2)

Loss (e.g., MSE):

𝐿

(

𝑦

−

𝑦

)

(y−

)

5. Computing Gradients Step-by-Step

5.1 Gradient w.r.t. Output Layer Activation

∂

𝐿

∂

𝑦

−

𝑦

∂

∂L

−y

(From derivative of MSE.)

5.2 Gradient w.r.t. Output Layer Pre-activation

𝑧

(

)

(2)

∂

𝐿

∂

𝑧

(

)

∂

𝐿

∂

𝑦

⋅

∂

𝑦

∂

𝑧

(

)

∂z

(2)

∂L

∂

∂L

⋅

∂z

(2)

∂

If activation is sigmoid:

∂

𝑦

∂

𝑧

(

)

𝑦

(

−

𝑦

)

∂z

(2)

∂

(1−

)

So:

𝛿

(

)

∂

𝐿

∂

𝑧

(

)

(

𝑦

−

𝑦

)

⋅

𝑦

(

−

𝑦

)

(2)

∂z

(2)

∂L

−y)⋅

(1−

)

This term

𝛿

(

)

(2)

is the error signal for the output layer.

5.3 Gradient w.r.t. Output Layer Weights

∂

𝐿

∂

𝑊

(

)

𝛿

(

)

⋅

(

𝑎

(

)

𝑇

∂W

(2)

∂L

=δ

(2)

⋅(a

(1)

)

And for biases:

∂

𝐿

∂

𝑏

(

)

𝛿

(

)

∂b

(2)

∂L

=δ

(2)

6. Backpropagating to Hidden Layer

We now find how the error affects the previous layer.

6.1 Hidden layer error signal

𝛿

(

)

(

𝑊

(

)

𝑇

𝛿

(

)

⊙

𝜎

′

(

𝑧

(

)

(1)

=(W

(2)T

(2)

)⊙σ

′

(1)

)

Multiply by weights going forward

Multiply by the derivative of activation

Example: sigmoid derivative:

𝜎

′

(

𝑧

(

)

𝑎

(

)

(

−

𝑎

(

)

′

(1)

)=a

(1)

(1−a

(1)

)

6.2 Hidden layer weight gradient

∂

𝐿

∂

𝑊

(

)

𝛿

(

)

⋅

𝑥

𝑇

∂W

(1)

∂L

=δ

(1)

⋅x

And for biases:

∂

𝐿

∂

𝑏

(

)

𝛿

(

)

∂b

(1)

∂L

=δ

(1)

7. Full Backprop Summary

Forward pass:

Compute all activations layer by layer.

Backward pass:

Compute output error

𝛿

(

𝐿

)

(L)

For each layer going backward:

𝛿

(

𝑙

)

(

𝑊

(

𝑙

)

𝑇

𝛿

(

𝑙

)

⊙

𝜎

′

(

𝑧

(

𝑙

)

(l)

=(W

(l+1)T

(l+1)

)⊙σ

′

(l)

)

Weight gradients:

∂

𝐿

∂

𝑊

(

𝑙

)

𝛿

(

𝑙

)

(

𝑎

(

𝑙

−

)

𝑇

∂W

(l)

∂L

=δ

(l)

(l−1)

)

Bias gradients:

∂

𝐿

∂

𝑏

(

𝑙

)

𝛿

(

𝑙

)

∂b

(l)

∂L

=δ

(l)

These gradients are then used to update weights and biases.

8. Matrix Form of Backprop (General Case)

For layer

𝑙

Error term:

𝛿

(

𝑙

)

(

𝑊

(

𝑙

)

𝑇

𝛿

(

𝑙

)

⊙

𝜎

′

(

𝑧

(

𝑙

)

(l)

=(W

(l+1)T

(l+1)

)⊙σ

′

(l)

)

Weight gradient:

∇

𝑊

(

𝑙

)

𝐿

𝛿

(

𝑙

)

(

𝑎

(

𝑙

−

)

𝑇

∇

(l)

L=δ

(l)

(l−1)

)

Bias gradient:

∇

𝑏

(

𝑙

)

𝐿

𝛿

(

𝑙

)

∇

(l)

L=δ

(l)

This compact form allows training deep networks efficiently.

9. Why Backpropagation Works Efficiently

Avoids recomputing repeated derivatives

Uses cached values from forward pass

Operates layer-by-layer using matrix multiplications

Scales well with modern hardware (GPUs/TPUs)

Without backpropagation, training deep networks would be computationally impossible.

10. Common Activation Function Derivatives

Sigmoid:

𝜎

′

(

𝑧

)

𝜎

(

𝑧

)

(

−

𝜎

(

𝑧

)

′

(z)=σ(z)(1−σ(z))

ReLU:

ReLU

′

(

𝑧

)

{

𝑧

≤

ReLU

′

(z)={

z>0

z≤0

Tanh:

tanh

⁡

′

(

𝑧

)

−

tanh

⁡

(

𝑧

)

tanh

′

(z)=1−tanh

(z)

These derivatives are used repeatedly during backprop.

Conclusion

Backpropagation is simply repeated application of the chain rule over a layered computation graph. It calculates gradients of the loss function with respect to every weight in a neural network efficiently.

This mathematical foundation enables deep learning to train millions or even billions of parameters and is the core engine behind modern AI advances.

Learn Data Science Course in Hyderabad

Quantum Computing and Its Future Role in Data Science

A Guide to Explainable AI (XAI)

The Power of Geospatial Data Analysis

Visit Our Quality Thought Training Institute in Hyderabad