Monday, November 24, 2025

thumbnail

The Math Behind Backpropagation in Neural Networks

 The Math Behind Backpropagation in Neural Networks


Backpropagation is the algorithm used to train neural networks by computing how the loss changes with respect to every weight. It relies heavily on calculus, particularly the chain rule, to efficiently compute gradients.


Backprop ensures that weights update in the direction that reduces the error.


1. Basic Idea of Backpropagation


Goal:

Minimize the loss function 

๐ฟ

L by adjusting weights 

๐‘ค

w.


We use gradient descent:


๐‘ค

๐‘ค

๐œ‚

๐ฟ

๐‘ค

w←w−ฮท

∂w

∂L



So the key is computing the gradient 

๐ฟ

๐‘ค

∂w

∂L


.


Backpropagation calculates these gradients efficiently by applying the chain rule backward through the network.


2. Components of a Neural Network


Consider a single neuron:


๐‘ง

=

๐‘ค

๐‘‡

๐‘ฅ

+

๐‘

z=w

T

x+b

๐‘Ž

=

๐œŽ

(

๐‘ง

)

a=ฯƒ(z)


Where:


๐‘ฅ

x = inputs


๐‘ค

w = weights


๐‘

b = bias


๐œŽ

ฯƒ = activation function (sigmoid, ReLU, tanh, etc.)


๐‘Ž

a = output


3. The Chain Rule — Core of Backpropagation


If a variable 

๐‘ฆ

y depends on 

๐‘ข

u, and 

๐‘ข

u depends on 

๐‘ฅ

x:


๐‘ฆ

๐‘ฅ

=

๐‘ฆ

๐‘ข

๐‘ข

๐‘ฅ

∂x

∂y


=

∂u

∂y


∂x

∂u



Neural networks repeatedly apply this through many layers.


4. Backprop in a Simple 2-Layer Network


Consider:


Forward Pass

๐‘ง

(

1

)

=

๐‘Š

(

1

)

๐‘ฅ

+

๐‘

(

1

)

,

๐‘Ž

(

1

)

=

๐œŽ

(

๐‘ง

(

1

)

)

z

(1)

=W

(1)

x+b

(1)

,a

(1)

=ฯƒ(z

(1)

)

๐‘ง

(

2

)

=

๐‘Š

(

2

)

๐‘Ž

(

1

)

+

๐‘

(

2

)

,

๐‘Ž

(

2

)

=

๐‘ฆ

^

z

(2)

=W

(2)

a

(1)

+b

(2)

,a

(2)

=

y

^



Loss (e.g., MSE):


๐ฟ

=

1

2

(

๐‘ฆ

๐‘ฆ

^

)

2

L=

2

1


(y−

y

^


)

2

5. Computing Gradients Step-by-Step

5.1 Gradient w.r.t. Output Layer Activation

๐ฟ

๐‘ฆ

^

=

๐‘ฆ

^

๐‘ฆ

y

^


∂L


=

y

^


−y


(From derivative of MSE.)


5.2 Gradient w.r.t. Output Layer Pre-activation 

๐‘ง

(

2

)

z

(2)

๐ฟ

๐‘ง

(

2

)

=

๐ฟ

๐‘ฆ

^

๐‘ฆ

^

๐‘ง

(

2

)

∂z

(2)

∂L


=

y

^


∂L


∂z

(2)

y

^




If activation is sigmoid:


๐‘ฆ

^

๐‘ง

(

2

)

=

๐‘ฆ

^

(

1

๐‘ฆ

^

)

∂z

(2)

y

^



=

y

^


(1−

y

^


)


So:


๐›ฟ

(

2

)

=

๐ฟ

๐‘ง

(

2

)

=

(

๐‘ฆ

^

๐‘ฆ

)

๐‘ฆ

^

(

1

๐‘ฆ

^

)

ฮด

(2)

=

∂z

(2)

∂L


=(

y

^


−y)⋅

y

^


(1−

y

^


)


This term 

๐›ฟ

(

2

)

ฮด

(2)

 is the error signal for the output layer.


5.3 Gradient w.r.t. Output Layer Weights

๐ฟ

๐‘Š

(

2

)

=

๐›ฟ

(

2

)

(

๐‘Ž

(

1

)

)

๐‘‡

∂W

(2)

∂L


=ฮด

(2)

⋅(a

(1)

)

T


And for biases:


๐ฟ

๐‘

(

2

)

=

๐›ฟ

(

2

)

∂b

(2)

∂L


=ฮด

(2)

6. Backpropagating to Hidden Layer


We now find how the error affects the previous layer.


6.1 Hidden layer error signal

๐›ฟ

(

1

)

=

(

๐‘Š

(

2

)

๐‘‡

๐›ฟ

(

2

)

)

๐œŽ

(

๐‘ง

(

1

)

)

ฮด

(1)

=(W

(2)T

ฮด

(2)

)⊙ฯƒ

(z

(1)

)


Multiply by weights going forward


Multiply by the derivative of activation


Example: sigmoid derivative:


๐œŽ

(

๐‘ง

(

1

)

)

=

๐‘Ž

(

1

)

(

1

๐‘Ž

(

1

)

)

ฯƒ

(z

(1)

)=a

(1)

(1−a

(1)

)

6.2 Hidden layer weight gradient

๐ฟ

๐‘Š

(

1

)

=

๐›ฟ

(

1

)

๐‘ฅ

๐‘‡

∂W

(1)

∂L


=ฮด

(1)

⋅x

T


And for biases:


๐ฟ

๐‘

(

1

)

=

๐›ฟ

(

1

)

∂b

(1)

∂L


=ฮด

(1)

7. Full Backprop Summary

Forward pass:


Compute all activations layer by layer.


Backward pass:


Compute output error 

๐›ฟ

(

๐ฟ

)

ฮด

(L)


For each layer going backward:


๐›ฟ

(

๐‘™

)

=

(

๐‘Š

(

๐‘™

+

1

)

๐‘‡

๐›ฟ

(

๐‘™

+

1

)

)

๐œŽ

(

๐‘ง

(

๐‘™

)

)

ฮด

(l)

=(W

(l+1)T

ฮด

(l+1)

)⊙ฯƒ

(z

(l)

)


Weight gradients:


๐ฟ

๐‘Š

(

๐‘™

)

=

๐›ฟ

(

๐‘™

)

(

๐‘Ž

(

๐‘™

1

)

)

๐‘‡

∂W

(l)

∂L


=ฮด

(l)

(a

(l−1)

)

T


Bias gradients:


๐ฟ

๐‘

(

๐‘™

)

=

๐›ฟ

(

๐‘™

)

∂b

(l)

∂L


=ฮด

(l)


These gradients are then used to update weights and biases.


8. Matrix Form of Backprop (General Case)


For layer 

๐‘™

l:


Error term:

๐›ฟ

(

๐‘™

)

=

(

๐‘Š

(

๐‘™

+

1

)

๐‘‡

๐›ฟ

(

๐‘™

+

1

)

)

๐œŽ

(

๐‘ง

(

๐‘™

)

)

ฮด

(l)

=(W

(l+1)T

ฮด

(l+1)

)⊙ฯƒ

(z

(l)

)

Weight gradient:

๐‘Š

(

๐‘™

)

๐ฟ

=

๐›ฟ

(

๐‘™

)

(

๐‘Ž

(

๐‘™

1

)

)

๐‘‡

W

(l)


L=ฮด

(l)

(a

(l−1)

)

T

Bias gradient:

๐‘

(

๐‘™

)

๐ฟ

=

๐›ฟ

(

๐‘™

)

b

(l)


L=ฮด

(l)


This compact form allows training deep networks efficiently.


9. Why Backpropagation Works Efficiently


Avoids recomputing repeated derivatives


Uses cached values from forward pass


Operates layer-by-layer using matrix multiplications


Scales well with modern hardware (GPUs/TPUs)


Without backpropagation, training deep networks would be computationally impossible.


10. Common Activation Function Derivatives

Sigmoid:

๐œŽ

(

๐‘ง

)

=

๐œŽ

(

๐‘ง

)

(

1

๐œŽ

(

๐‘ง

)

)

ฯƒ

(z)=ฯƒ(z)(1−ฯƒ(z))

ReLU:

ReLU

(

๐‘ง

)

=

{

1

๐‘ง

>

0



0

๐‘ง

0

ReLU

(z)={

1

0


z>0

z≤0


Tanh:

tanh

(

๐‘ง

)

=

1

tanh

2

(

๐‘ง

)

tanh

(z)=1−tanh

2

(z)


These derivatives are used repeatedly during backprop.


Conclusion


Backpropagation is simply repeated application of the chain rule over a layered computation graph. It calculates gradients of the loss function with respect to every weight in a neural network efficiently.


This mathematical foundation enables deep learning to train millions or even billions of parameters and is the core engine behind modern AI advances.

Learn Data Science Course in Hyderabad

Read More

Advanced Machine Learning & Deep Learning

Quantum Computing and Its Future Role in Data Science

A Guide to Explainable AI (XAI)

The Power of Geospatial Data Analysis

Visit Our Quality Thought Training Institute in Hyderabad

Get Directions 

Subscribe by Email

Follow Updates Articles from This Blog via Email

No Comments

About

Search This Blog

Powered by Blogger.

Blog Archive