Back-propagation and forward propagation in Neural Networks

This tutorial will cover the following sections:

  1. Introduction to Backpropagation
  2. The Chain Rule
  3. Derivation of Backpropagation
  4. Example of Backpropagation in Action
  5. Diagrams for Visualization

1. Introduction to Backpropagation

Backpropagation is a fundamental algorithm used in training artificial neural networks. It optimizes the weights of the network by minimizing the error between the predicted output and the actual output. The algorithm works by propagating the error backward through the network, adjusting the weights based on the gradient of the error.

2. The Chain Rule

The chain rule is a key mathematical concept used in backpropagation. It allows us to compute the derivative of a composite function. For functions \(f\) and \(g\), the chain rule states: ddxf(g(x))=f(g(x))g(x)\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)

3. Derivation of Backpropagation

Let's consider a simple neural network with one hidden layer.

The simple neural-network under consideration
The simple neural-network under consideration

We denote:

  • xx as the input
  • w1,w2w_1, w_2 as the weights
  • b1,b2b_1, b_2 as the biases
  • hh as the hidden layer activation
  • yy as the output
  • LL as the loss function

The forward pass equations are: h=σ(w1x+b1)h = \sigma(w_1 x + b_1) y=w2h+b2y = w_2 h + b_2

Assuming a mean squared error loss: L=12(yt)2L = \frac{1}{2} (y - t)^2

where tt is the target output.

To minimize LL with respect to the weights and biases. We need to compute the gradients of LL with respect to w1,w2,b1,w_1, w_2, b_1, and b2b_2. Using the chain rule, we get:

For w2w_2: Lw2=Lyyw2\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial w_2} Ly=yt\frac{\partial L}{\partial y} = y - t yw2=h\frac{\partial y}{\partial w_2} = h Lw2=(yt)h\frac{\partial L}{\partial w_2} = (y - t) \cdot h

For b2b_2: Lb2=Lyyb2\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial b_2} yb2=1\frac{\partial y}{\partial b_2} = 1 Lb2=yt\frac{\partial L}{\partial b_2} = y - t

For w1w_1: Lw1=Lyyhhw1\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial w_1} yh=w2\frac{\partial y}{\partial h} = w_2 hw1=σ(w1x+b1)x\frac{\partial h}{\partial w_1} = \sigma'(w_1 x + b_1) \cdot x Lw1=(yt)w2σ(w1x+b1)x\frac{\partial L}{\partial w_1} = (y - t) \cdot w_2 \cdot \sigma'(w_1 x + b_1) \cdot x

For b1b_1: Lb1=Lyyhhb1\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial b_1} hb1=σ(w1x+b1)\frac{\partial h}{\partial b_1} = \sigma'(w_1 x + b_1) Lb1=(yt)w2σ(w1x+b1)\frac{\partial L}{\partial b_1} = (y - t) \cdot w_2 \cdot \sigma'(w_1 x + b_1)

Illustration of how the gradient of the cost function with respect to the output flows backwards to feed the next stage of gradient calculation via the chain rule, till we derive the gradient of the loss with respect to all of the weights and biases.
Illustration of how the gradient of the cost function with respect to the output flows backwards to feed the next stage of gradient calculation via the chain rule, till we derive the gradient of the loss with respect to all of the weights and biases.

4. Example of Backpropagation in Action

Let's consider a simple example with concrete values:

  • Input x=0.5x = 0.5
  • Weights w1=0.4w_1 = 0.4, w2=0.3w_2 = 0.3
  • Biases b1=0.1b_1 = 0.1, b2=0.2b_2 = 0.2
  • Target t=0.6t = 0.6
  • Sigmoid activation function σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Forward Pass:

h=σ(w1x+b1)=σ(0.40.5+0.1)=σ(0.3)=0.5744h = \sigma(w_1 x + b_1) = \sigma(0.4 \cdot 0.5 + 0.1) = \sigma(0.3) = 0.5744 y=w2h+b2=0.30.5744+0.2=0.3723y = w_2 h + b_2 = 0.3 \cdot 0.5744 + 0.2 = 0.3723

Illustration of a forward pass.
Illustration of a forward pass.

Loss:

L=12(yt)2=12(0.37230.6)2=0.0260L = \frac{1}{2} (y - t)^2 = \frac{1}{2} (0.3723 - 0.6)^2 = 0.0260

Backward Pass:

Ly=(yt)=0.37230.6=0.2277\frac{\partial L}{\partial y} = (y-t) = 0.3723 - 0.6 = -0.2277 Lw2=Lyyw2=0.2277h=0.22770.5744=0.1309\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial w_2} = -0.2277 \cdot h = -0.2277 \cdot 0.5744 = -0.1309 Lb2=0.2277\frac{\partial L}{\partial b_2} = -0.2277

σ(0.3)=σ(0.3)(1σ(0.3))=0.5744(10.5744)=0.2445\sigma'(0.3) = \sigma(0.3)(1 - \sigma(0.3)) = 0.5744(1 - 0.5744) = 0.2445 Lw1=0.22770.30.24450.5=0.0083\frac{\partial L}{\partial w_1} = -0.2277 \cdot 0.3 \cdot 0.2445 \cdot 0.5 = -0.0083 Lb1=0.22770.30.2445=0.0166\frac{\partial L}{\partial b_1} = -0.2277 \cdot 0.3 \cdot 0.2445 = -0.0166.

and now, after we calculated the derivatives of the loss function with respect to all the weights and biases, we concatenate it into a vector and take a small step in the direction negative to the gradient vector to get new values for the weights and the biases for the next iteration of forward pass —> Back-prop.

Here’s a Python code example that manually implements backpropagation for the simple neural network in the above example.

image

Observe the following:

  1. The loss (and the gradient vector norm/length) is going down with iterations(AKA epochs).
  2. The Output y is getting closer to the target (0.6) with each eopch.

Now, let's break down the code and explain each step:

  1. Import and Helper Functions:
    • We import NumPy for numerical operations.
    • We define the sigmoid function and its derivative, which we'll use for the activation function in the hidden layer.
  2. Initialization:
    • We set up the initial values for the input (x), weights (w1, w2), biases (b1, b2), and the target output.
    • We also define a learning rate, which determines the step size for parameter updates.
  3. Training Loop:
    • We run the training process for 1000 epochs.
  4. Forward Pass:
    • We compute the hidden layer activation (h) using the sigmoid function.
    • We then compute the output (y) using a linear combination of the hidden layer activation.
  5. Loss Computation:
    • We calculate the mean squared error loss.
  6. Backward Pass:
    • We compute the gradients of the loss with respect to each parameter using the chain rule: a. dL_dy: Gradient of loss with respect to the output b. dL_dw2 and dL_db2: Gradients for the output layer parameters c. dL_dh: Gradient of loss with respect to the hidden layer activation d. dL_dz1: Gradient of loss with respect to the input of the sigmoid function e. dL_dw1 and dL_db1: Gradients for the hidden layer parameters
  7. Parameter Updates:
    • We update each parameter (w1, b1, w2, b2) by subtracting the product of the learning rate and the corresponding gradient.
  8. Progress Tracking:
    • We print the loss and output every 100 epochs to monitor the training progress.
  9. Final Results:
    • After training, we perform a final forward pass and print the output along with the final parameter values and plotting.

This code demonstrates the manual implementation of backpropagation for a simple neural network with one hidden layer. It shows how the gradients are computed using the chain rule and how the parameters are updated to minimize the loss function.

The key aspects of backpropagation demonstrated here are:

  1. Forward pass to compute the network's output
  2. Computation of the loss
  3. Backward pass to compute gradients using the chain rule
  4. Parameter updates using gradient descent

By running this code, you can observe how the network learns to approximate the target output over time, with the loss decreasing and the output approaching the target value.

5. Conclusion

This tutorial provides a comprehensive understanding of backpropagation, including the derivation using the chain rule and a detailed example. Feel free to ask if you need further details or explanations on any part of this tutorial!