This tutorial will cover the following sections:
- Introduction to Backpropagation
- The Chain Rule
- Derivation of Backpropagation
- Example of Backpropagation in Action
- Diagrams for Visualization
1. Introduction to Backpropagation
Backpropagation is a fundamental algorithm used in training artificial neural networks. It optimizes the weights of the network by minimizing the error between the predicted output and the actual output. The algorithm works by propagating the error backward through the network, adjusting the weights based on the gradient of the error.
2. The Chain Rule
The chain rule is a key mathematical concept used in backpropagation. It allows us to compute the derivative of a composite function. For functions \(f\) and \(g\), the chain rule states:
3. Derivation of Backpropagation
Let's consider a simple neural network with one hidden layer.
We denote:
- as the input
- as the weights
- as the biases
- as the hidden layer activation
- as the output
- as the loss function
The forward pass equations are:
Assuming a mean squared error loss:
where is the target output.
To minimize with respect to the weights and biases. We need to compute the gradients of with respect to and . Using the chain rule, we get:
For :
For :
For :
For :
4. Example of Backpropagation in Action
Let's consider a simple example with concrete values:
- Input
- Weights ,
- Biases ,
- Target
- Sigmoid activation function
Forward Pass:
Loss:
Backward Pass:
.
and now, after we calculated the derivatives of the loss function with respect to all the weights and biases, we concatenate it into a vector and take a small step in the direction negative to the gradient vector to get new values for the weights and the biases for the next iteration of forward pass —> Back-prop.
Here’s a Python code example that manually implements backpropagation for the simple neural network in the above example.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return sigmoid(x) * (1 - sigmoid(x))
# Initialize parameters
x = 0.5
w1, w2 = 0.4, 0.3
b1, b2 = 0.1, 0.2
target = 0.6
learning_rate = 0.1
# Lists to store loss and gradient norm for plotting
losses = []
grad_norms = []
# Training loop
num_epochs = 60
for epoch in range(num_epochs):
# Forward pass
z1 = w1 * x + b1
h = sigmoid(z1)
y = w2 * h + b2
# Compute loss
loss = 0.5 * (y - target)**2
losses.append(loss)
# Backward pass (back propagation)
dL_dy = y - target
# Gradients for output layer
dL_dw2 = dL_dy * h
dL_db2 = dL_dy
# Gradients for hidden layer
dL_dh = dL_dy * w2
dL_dz1 = dL_dh * sigmoid_derivative(z1)
dL_dw1 = dL_dz1 * x
dL_db1 = dL_dz1
# Compute gradient norm
grad_norm = np.sqrt(dL_dw1**2 + dL_db1**2 + dL_dw2**2 + dL_db2**2)
grad_norms.append(grad_norm)
# Update parameters
w1 -= learning_rate * dL_dw1
b1 -= learning_rate * dL_db1
w2 -= learning_rate * dL_dw2
b2 -= learning_rate * dL_db2
# Print progress every 100 epochs
if epoch % 5 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}, Output: {y:.4f}")
# Final forward pass
z1 = w1 * x + b1
h = sigmoid(z1)
y = w2 * h + b2
print(f"\nFinal output: {y:.4f}")
print(f"Final parameters: w1 = {w1:.4f}, b1 = {b1:.4f}, w2 = {w2:.4f}, b2 = {b2:.4f}")
# Plotting
plt.figure(figsize=(12, 5))
# Plot loss
plt.subplot(1, 2, 1)
plt.plot(range(num_epochs), losses)
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
# Plot gradient norm
plt.subplot(1, 2, 2)
plt.plot(range(num_epochs), grad_norms)
plt.title('Gradient Norm over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Gradient Norm')
plt.tight_layout()
plt.show()
Epoch 0, Loss: 0.0259, Output: 0.3723
Epoch 5, Loss: 0.0062, Output: 0.4891
Epoch 10, Loss: 0.0015, Output: 0.5460
Epoch 15, Loss: 0.0003, Output: 0.5738
Epoch 20, Loss: 0.0001, Output: 0.5873
Epoch 25, Loss: 0.0000, Output: 0.5938
Epoch 30, Loss: 0.0000, Output: 0.5970
Epoch 35, Loss: 0.0000, Output: 0.5985
Epoch 40, Loss: 0.0000, Output: 0.5993
Epoch 45, Loss: 0.0000, Output: 0.5997
Epoch 50, Loss: 0.0000, Output: 0.5998
Epoch 55, Loss: 0.0000, Output: 0.5999
Final output: 0.6000
Final parameters: w1 = 0.4072, b1 = 0.1143, w2 = 0.3978, b2 = 0.3697
Observe the following:
- The loss (and the gradient vector norm/length) is going down with iterations(AKA epochs).
- The Output y is getting closer to the target (0.6) with each eopch.
Now, let's break down the code and explain each step:
- Import and Helper Functions:
- We import NumPy for numerical operations.
- We define the sigmoid function and its derivative, which we'll use for the activation function in the hidden layer.
- Initialization:
- We set up the initial values for the input (x), weights (w1, w2), biases (b1, b2), and the target output.
- We also define a learning rate, which determines the step size for parameter updates.
- Training Loop:
- We run the training process for 1000 epochs.
- Forward Pass:
- We compute the hidden layer activation (h) using the sigmoid function.
- We then compute the output (y) using a linear combination of the hidden layer activation.
- Loss Computation:
- We calculate the mean squared error loss.
- Backward Pass:
- We compute the gradients of the loss with respect to each parameter using the chain rule: a. dL_dy: Gradient of loss with respect to the output b. dL_dw2 and dL_db2: Gradients for the output layer parameters c. dL_dh: Gradient of loss with respect to the hidden layer activation d. dL_dz1: Gradient of loss with respect to the input of the sigmoid function e. dL_dw1 and dL_db1: Gradients for the hidden layer parameters
- Parameter Updates:
- We update each parameter (w1, b1, w2, b2) by subtracting the product of the learning rate and the corresponding gradient.
- Progress Tracking:
- We print the loss and output every 100 epochs to monitor the training progress.
- Final Results:
- After training, we perform a final forward pass and print the output along with the final parameter values and plotting.
This code demonstrates the manual implementation of backpropagation for a simple neural network with one hidden layer. It shows how the gradients are computed using the chain rule and how the parameters are updated to minimize the loss function.
The key aspects of backpropagation demonstrated here are:
- Forward pass to compute the network's output
- Computation of the loss
- Backward pass to compute gradients using the chain rule
- Parameter updates using gradient descent
By running this code, you can observe how the network learns to approximate the target output over time, with the loss decreasing and the output approaching the target value.
5. Conclusion
This tutorial provides a comprehensive understanding of backpropagation, including the derivation using the chain rule and a detailed example. Feel free to ask if you need further details or explanations on any part of this tutorial!