This tutorial will cover the following sections:
- Introduction to Backpropagation
- The Chain Rule
- Derivation of Backpropagation
- Example of Backpropagation in Action
- Diagrams for Visualization
1. Introduction to Backpropagation
Backpropagation is a fundamental algorithm used in training artificial neural networks. It optimizes the weights of the network by minimizing the error between the predicted output and the actual output. The algorithm works by propagating the error backward through the network, adjusting the weights based on the gradient of the error.
2. The Chain Rule
The chain rule is a key mathematical concept used in backpropagation. It allows us to compute the derivative of a composite function. For functions \(f\) and \(g\), the chain rule states:
3. Derivation of Backpropagation
Let's consider a simple neural network with one hidden layer.
We denote:
- as the input
- as the weights
- as the biases
- as the hidden layer activation
- as the output
- as the loss function
The forward pass equations are:
Assuming a mean squared error loss:
where is the target output.
To minimize with respect to the weights and biases. We need to compute the gradients of with respect to and . Using the chain rule, we get:
For :
For :
For :
For :
4. Example of Backpropagation in Action
Let's consider a simple example with concrete values:
- Input
- Weights ,
- Biases ,
- Target
- Sigmoid activation function
Forward Pass:
Loss:
Backward Pass:
.
and now, after we calculated the derivatives of the loss function with respect to all the weights and biases, we concatenate it into a vector and take a small step in the direction negative to the gradient vector to get new values for the weights and the biases for the next iteration of forward pass —> Back-prop.
Here’s a Python code example that manually implements backpropagation for the simple neural network in the above example.
Observe the following:
- The loss (and the gradient vector norm/length) is going down with iterations(AKA epochs).
- The Output y is getting closer to the target (0.6) with each eopch.
Now, let's break down the code and explain each step:
- Import and Helper Functions:
- We import NumPy for numerical operations.
- We define the sigmoid function and its derivative, which we'll use for the activation function in the hidden layer.
- Initialization:
- We set up the initial values for the input (x), weights (w1, w2), biases (b1, b2), and the target output.
- We also define a learning rate, which determines the step size for parameter updates.
- Training Loop:
- We run the training process for 1000 epochs.
- Forward Pass:
- We compute the hidden layer activation (h) using the sigmoid function.
- We then compute the output (y) using a linear combination of the hidden layer activation.
- Loss Computation:
- We calculate the mean squared error loss.
- Backward Pass:
- We compute the gradients of the loss with respect to each parameter using the chain rule: a. dL_dy: Gradient of loss with respect to the output b. dL_dw2 and dL_db2: Gradients for the output layer parameters c. dL_dh: Gradient of loss with respect to the hidden layer activation d. dL_dz1: Gradient of loss with respect to the input of the sigmoid function e. dL_dw1 and dL_db1: Gradients for the hidden layer parameters
- Parameter Updates:
- We update each parameter (w1, b1, w2, b2) by subtracting the product of the learning rate and the corresponding gradient.
- Progress Tracking:
- We print the loss and output every 100 epochs to monitor the training progress.
- Final Results:
- After training, we perform a final forward pass and print the output along with the final parameter values and plotting.
This code demonstrates the manual implementation of backpropagation for a simple neural network with one hidden layer. It shows how the gradients are computed using the chain rule and how the parameters are updated to minimize the loss function.
The key aspects of backpropagation demonstrated here are:
- Forward pass to compute the network's output
- Computation of the loss
- Backward pass to compute gradients using the chain rule
- Parameter updates using gradient descent
By running this code, you can observe how the network learns to approximate the target output over time, with the loss decreasing and the output approaching the target value.
5. Conclusion
This tutorial provides a comprehensive understanding of backpropagation, including the derivation using the chain rule and a detailed example. Feel free to ask if you need further details or explanations on any part of this tutorial!