Back-propagation and forward propagation in Neural Networks

This tutorial will cover the following sections:

  1. Introduction to Backpropagation
  2. The Chain Rule
  3. Derivation of Backpropagation
  4. Example of Backpropagation in Action
  5. Diagrams for Visualization

1. Introduction to Backpropagation

Backpropagation is a fundamental algorithm used in training artificial neural networks. It optimizes the weights of the network by minimizing the error between the predicted output and the actual output. The algorithm works by propagating the error backward through the network, adjusting the weights based on the gradient of the error.

2. The Chain Rule

The chain rule is a key mathematical concept used in backpropagation. It allows us to compute the derivative of a composite function. For functions \(f\) and \(g\), the chain rule states: ddxf(g(x))=f(g(x))g(x)\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)

3. Derivation of Backpropagation

Let's consider a simple neural network with one hidden layer.

The simple neural-network under consideration
The simple neural-network under consideration

We denote:

  • xx as the input
  • w1,w2w_1, w_2 as the weights
  • b1,b2b_1, b_2 as the biases
  • hh as the hidden layer activation
  • yy as the output
  • LL as the loss function

The forward pass equations are: h=σ(w1x+b1)h = \sigma(w_1 x + b_1) y=w2h+b2y = w_2 h + b_2

Assuming a mean squared error loss: L=12(yt)2L = \frac{1}{2} (y - t)^2

where tt is the target output.

To minimize LL with respect to the weights and biases. We need to compute the gradients of LL with respect to w1,w2,b1,w_1, w_2, b_1, and b2b_2. Using the chain rule, we get:

For w2w_2: Lw2=Lyyw2\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial w_2} Ly=yt\frac{\partial L}{\partial y} = y - t yw2=h\frac{\partial y}{\partial w_2} = h Lw2=(yt)h\frac{\partial L}{\partial w_2} = (y - t) \cdot h

For b2b_2: Lb2=Lyyb2\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial b_2} yb2=1\frac{\partial y}{\partial b_2} = 1 Lb2=yt\frac{\partial L}{\partial b_2} = y - t

For w1w_1: Lw1=Lyyhhw1\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial w_1} yh=w2\frac{\partial y}{\partial h} = w_2 hw1=σ(w1x+b1)x\frac{\partial h}{\partial w_1} = \sigma'(w_1 x + b_1) \cdot x Lw1=(yt)w2σ(w1x+b1)x\frac{\partial L}{\partial w_1} = (y - t) \cdot w_2 \cdot \sigma'(w_1 x + b_1) \cdot x

For b1b_1: Lb1=Lyyhhb1\frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial b_1} hb1=σ(w1x+b1)\frac{\partial h}{\partial b_1} = \sigma'(w_1 x + b_1) Lb1=(yt)w2σ(w1x+b1)\frac{\partial L}{\partial b_1} = (y - t) \cdot w_2 \cdot \sigma'(w_1 x + b_1)

Illustration of how the gradient of the cost function with respect to the output flows backwards to feed the next stage of gradient calculation via the chain rule, till we derive the gradient of the loss with respect to all of the weights and biases.
Illustration of how the gradient of the cost function with respect to the output flows backwards to feed the next stage of gradient calculation via the chain rule, till we derive the gradient of the loss with respect to all of the weights and biases.

4. Example of Backpropagation in Action

Let's consider a simple example with concrete values:

  • Input x=0.5x = 0.5
  • Weights w1=0.4w_1 = 0.4, w2=0.3w_2 = 0.3
  • Biases b1=0.1b_1 = 0.1, b2=0.2b_2 = 0.2
  • Target t=0.6t = 0.6
  • Sigmoid activation function σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Forward Pass:

h=σ(w1x+b1)=σ(0.40.5+0.1)=σ(0.3)=0.5744h = \sigma(w_1 x + b_1) = \sigma(0.4 \cdot 0.5 + 0.1) = \sigma(0.3) = 0.5744 y=w2h+b2=0.30.5744+0.2=0.3723y = w_2 h + b_2 = 0.3 \cdot 0.5744 + 0.2 = 0.3723

Illustration of a forward pass.
Illustration of a forward pass.

Loss:

L=12(yt)2=12(0.37230.6)2=0.0260L = \frac{1}{2} (y - t)^2 = \frac{1}{2} (0.3723 - 0.6)^2 = 0.0260

Backward Pass:

Ly=(yt)=0.37230.6=0.2277\frac{\partial L}{\partial y} = (y-t) = 0.3723 - 0.6 = -0.2277 Lw2=Lyyw2=0.2277h=0.22770.5744=0.1309\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial w_2} = -0.2277 \cdot h = -0.2277 \cdot 0.5744 = -0.1309 Lb2=0.2277\frac{\partial L}{\partial b_2} = -0.2277

σ(0.3)=σ(0.3)(1σ(0.3))=0.5744(10.5744)=0.2445\sigma'(0.3) = \sigma(0.3)(1 - \sigma(0.3)) = 0.5744(1 - 0.5744) = 0.2445 Lw1=0.22770.30.24450.5=0.0083\frac{\partial L}{\partial w_1} = -0.2277 \cdot 0.3 \cdot 0.2445 \cdot 0.5 = -0.0083 Lb1=0.22770.30.2445=0.0166\frac{\partial L}{\partial b_1} = -0.2277 \cdot 0.3 \cdot 0.2445 = -0.0166.

and now, after we calculated the derivatives of the loss function with respect to all the weights and biases, we concatenate it into a vector and take a small step in the direction negative to the gradient vector to get new values for the weights and the biases for the next iteration of forward pass —> Back-prop.

Here’s a Python code example that manually implements backpropagation for the simple neural network in the above example.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Initialize parameters
x = 0.5
w1, w2 = 0.4, 0.3
b1, b2 = 0.1, 0.2
target = 0.6
learning_rate = 0.1

# Lists to store loss and gradient norm for plotting
losses = []
grad_norms = []

# Training loop
num_epochs = 60
for epoch in range(num_epochs):
    # Forward pass
    z1 = w1 * x + b1
    h = sigmoid(z1)
    y = w2 * h + b2
    
    # Compute loss
    loss = 0.5 * (y - target)**2
    losses.append(loss)
    
    # Backward pass (back propagation)
    dL_dy = y - target
    
    # Gradients for output layer
    dL_dw2 = dL_dy * h
    dL_db2 = dL_dy
    
    # Gradients for hidden layer
    dL_dh = dL_dy * w2
    dL_dz1 = dL_dh * sigmoid_derivative(z1)
    dL_dw1 = dL_dz1 * x
    dL_db1 = dL_dz1
    
    # Compute gradient norm
    grad_norm = np.sqrt(dL_dw1**2 + dL_db1**2 + dL_dw2**2 + dL_db2**2)
    grad_norms.append(grad_norm)
    
    # Update parameters
    w1 -= learning_rate * dL_dw1
    b1 -= learning_rate * dL_db1
    w2 -= learning_rate * dL_dw2
    b2 -= learning_rate * dL_db2
    
    # Print progress every 100 epochs
    if epoch % 5 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}, Output: {y:.4f}")

# Final forward pass
z1 = w1 * x + b1
h = sigmoid(z1)
y = w2 * h + b2

print(f"\nFinal output: {y:.4f}")
print(f"Final parameters: w1 = {w1:.4f}, b1 = {b1:.4f}, w2 = {w2:.4f}, b2 = {b2:.4f}")

# Plotting
plt.figure(figsize=(12, 5))

# Plot loss
plt.subplot(1, 2, 1)
plt.plot(range(num_epochs), losses)
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')

# Plot gradient norm
plt.subplot(1, 2, 2)
plt.plot(range(num_epochs), grad_norms)
plt.title('Gradient Norm over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Gradient Norm')

plt.tight_layout()
plt.show()
Epoch 0, Loss: 0.0259, Output: 0.3723
Epoch 5, Loss: 0.0062, Output: 0.4891
Epoch 10, Loss: 0.0015, Output: 0.5460
Epoch 15, Loss: 0.0003, Output: 0.5738
Epoch 20, Loss: 0.0001, Output: 0.5873
Epoch 25, Loss: 0.0000, Output: 0.5938
Epoch 30, Loss: 0.0000, Output: 0.5970
Epoch 35, Loss: 0.0000, Output: 0.5985
Epoch 40, Loss: 0.0000, Output: 0.5993
Epoch 45, Loss: 0.0000, Output: 0.5997
Epoch 50, Loss: 0.0000, Output: 0.5998
Epoch 55, Loss: 0.0000, Output: 0.5999

Final output: 0.6000
Final parameters: w1 = 0.4072, b1 = 0.1143, w2 = 0.3978, b2 = 0.3697
image

Observe the following:

  1. The loss (and the gradient vector norm/length) is going down with iterations(AKA epochs).
  2. The Output y is getting closer to the target (0.6) with each eopch.

Now, let's break down the code and explain each step:

  1. Import and Helper Functions:
    • We import NumPy for numerical operations.
    • We define the sigmoid function and its derivative, which we'll use for the activation function in the hidden layer.
  2. Initialization:
    • We set up the initial values for the input (x), weights (w1, w2), biases (b1, b2), and the target output.
    • We also define a learning rate, which determines the step size for parameter updates.
  3. Training Loop:
    • We run the training process for 1000 epochs.
  4. Forward Pass:
    • We compute the hidden layer activation (h) using the sigmoid function.
    • We then compute the output (y) using a linear combination of the hidden layer activation.
  5. Loss Computation:
    • We calculate the mean squared error loss.
  6. Backward Pass:
    • We compute the gradients of the loss with respect to each parameter using the chain rule: a. dL_dy: Gradient of loss with respect to the output b. dL_dw2 and dL_db2: Gradients for the output layer parameters c. dL_dh: Gradient of loss with respect to the hidden layer activation d. dL_dz1: Gradient of loss with respect to the input of the sigmoid function e. dL_dw1 and dL_db1: Gradients for the hidden layer parameters
  7. Parameter Updates:
    • We update each parameter (w1, b1, w2, b2) by subtracting the product of the learning rate and the corresponding gradient.
  8. Progress Tracking:
    • We print the loss and output every 100 epochs to monitor the training progress.
  9. Final Results:
    • After training, we perform a final forward pass and print the output along with the final parameter values and plotting.

This code demonstrates the manual implementation of backpropagation for a simple neural network with one hidden layer. It shows how the gradients are computed using the chain rule and how the parameters are updated to minimize the loss function.

The key aspects of backpropagation demonstrated here are:

  1. Forward pass to compute the network's output
  2. Computation of the loss
  3. Backward pass to compute gradients using the chain rule
  4. Parameter updates using gradient descent

By running this code, you can observe how the network learns to approximate the target output over time, with the loss decreasing and the output approaching the target value.

5. Conclusion

This tutorial provides a comprehensive understanding of backpropagation, including the derivation using the chain rule and a detailed example. Feel free to ask if you need further details or explanations on any part of this tutorial!