Chapter 6: Neural Networks and Deep Learning

Chapter 6: Neural Networks and Deep Learning

Welcome to the captivating world of neural networks and deep learning! In this chapter, we'll embark on a journey to understand how these remarkable techniques are transforming the field of artificial intelligence. Prepare to unravel the mysteries behind machines that can learn and solve complex problems, much like the incredible human brain!

1. Introduction to Neural Networks

1.1 What are Neural Networks?

Imagine your brain as a vast, interconnected network of tiny cells called neurons. These neurons work together harmoniously to process information, make decisions, and control your every action. In the realm of AI, we have artificial neural networks that strive to emulate the functioning of the human brain.

An artificial neural network is essentially a computer program designed to recognize patterns and learn from examples, similar to how you learn from your own experiences. It comprises layers of interconnected nodes, analogous to the neurons in your brain. Each node has an associated weight that determines its influence on the final output.

Let's take a look at a simple example in Python to create a basic neural network layer using the powerful Keras library:

from keras.layers import Dense

# Create a dense layer with 10 nodes and an input shape of 5
layer = Dense(10, input_shape=(5,))

In this code snippet, we create a dense layer (fully connected layer) with 10 nodes, and it expects an input shape of 5. Dense layers are the most commonly used type of layers in neural networks.

1.2 The Building Blocks: Neurons and Layers

Just as your brain consists of billions of interconnected neurons, artificial neural networks are composed of nodes organized into layers. Each layer plays a specific role in processing the input data and passing it to the next layer. The most common types of layers are:

  • Input Layer: This layer receives the initial input data and passes it to the subsequent layers for processing.
  • Hidden Layers: These are the intermediate layers between the input and output layers. They perform complex computations and feature extraction on the input data.
  • Output Layer: The final layer produces the desired output based on the processed data from the previous layers.

A fully-connected, also known as “Dense” in code, also known as Multi-Layer Perceptron is show below

A fully-connected Neural Network illustration. Credit:
A fully-connected Neural Network illustration. Credit: Designing Your Neural Networks In the image above, each node (neuron), is weighting the inputs and then summing them adding a bias term. Then the resultant scalar (a number) is undergoing a nonlinear transformation through an activation function. The below illustration is how a neuron of 3 inputs is weighting the inputs and summing them together with a bias term (constant, number, scalar) and the result is fed into an activation function f()f(\cdot)
An illustration of a Neuron/Node in a fully connected network.
An illustration of a Neuron/Node in a fully connected network.

Here's an example of a simple neural network architecture (fully-connected network) with one hidden layer:

from keras.models import Sequential
from keras.layers import Dense

# Create a sequential model
model = Sequential()

# Add an input layer and a hidden layer with 10 nodes
model.add(Dense(10, input_shape=(5,)))

# Add an output layer with 1 node
model.add(Dense(1))

# Generate and print a summary of the model
model.summary()
Model: "sequential_1"
┌─────────────────────────────────┬────────────────────────┬───────────────┐
│ Layer (type)                    │ Output Shape           │       Param # │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 10)             │            60 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 1)              │            11 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 71 (284.00 B)
 Trainable params: 71 (284.00 B)
 Non-trainable params: 0 (0.00 B)

In this example, we create a sequential model using the Sequential class from Keras. We add an input layer and a hidden layer with 10 nodes/neurons, followed by an output layer with a single node.

2. Activation Functions: Bringing Life to Neurons

2.1 The Role of Activation Functions

Just as your brain's neurons fire electrical signals, the nodes in a neural network need a way to determine when to activate and pass information along. This is where activation functions come into play!

Activation functions introduce non-linearity into the network, allowing it to learn complex patterns and make sophisticated decisions. They determine the output of a node based on its input. Without activation functions, neural networks would be limited to learning linear relationships, severely restricting their capabilities.

2.2 Common Activation Functions

There are several commonly used activation functions, each with its own characteristics and use cases. Let's explore a few of them:

  1. Sigmoid Function: The sigmoid function squishes the input value between 0 and 1, resembling a smooth switch. It is defined as:
  2. σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

    from keras.layers import Dense
    
    # Create a dense layer with sigmoid activation
    layer = Dense(10, activation='sigmoid')
  3. ReLU (Rectified Linear Unit): ReLU is a simple yet powerful activation function. It acts like a gate that allows positive values to pass through unchanged while setting negative values to zero. Mathematically, it is defined as:
  4. f(x)=max(0,x)f(x) = \max(0, x)

    from keras.layers import Dense
    
    # Create a dense layer with ReLU activation
    layer = Dense(10, activation='relu')
  5. Tanh (Hyperbolic Tangent): The tanh function is similar to the sigmoid function but outputs values between -1 and 1. It is defined as:
  6. tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

    from keras.layers import Dense
    
    # Create a dense layer with tanh activation
    layer = Dense(10, activation='tanh')

Here’s the Python code where we will visualize these three functions. This will help us understand how they behave across a range of input values:

import numpy as np
import matplotlib.pyplot as plt  # For creating graphs

# Define the activation functions
def sigmoid(x):
    # The sigmoid function outputs values between 0 and 1, which models probability.
    return 1 / (1 + np.exp(-x))

def relu(x):
    # The ReLU function outputs the input directly if it is positive; otherwise, it outputs zero.
    return np.maximum(0, x)

def tanh(x):
    # The tanh function outputs values between -1 and 1, which makes its output centered around zero.
    return np.tanh(x)

# Generate a range of values from -10 to 10, which we will use as inputs to our functions
x = np.linspace(-10, 10, 100)

# Compute the activations
y_sigmoid = sigmoid(x)
y_relu = relu(x)
y_tanh = tanh(x)

# Create the plots to visualize these functions
plt.figure(figsize=(10, 8))

# Plot Sigmoid
plt.subplot(3, 1, 1)
plt.plot(x, y_sigmoid, label="Sigmoid")
plt.title("Sigmoid Activation Function")
plt.grid(True)

# Plot ReLU
plt.subplot(3, 1, 2)
plt.plot(x, y_relu, label="ReLU")
plt.title("ReLU Activation Function")
plt.grid(True)

# Plot Tanh
plt.subplot(3, 1, 3)
plt.plot(x, y_tanh, label="Tanh")
plt.title("Tanh Activation Function")
plt.grid(True)

# Adjust layout and show the plots
plt.tight_layout()
plt.show()
image

What Can We Learn from the Output?

After running the above code, you'll see three graphs:

  • Sigmoid: Notice how the curve is S-shaped. It's great for scenarios where you want to classify things into two categories (like yes/no decisions).
  • ReLU: This looks like a ramp. It's very popular in deep learning because it helps models learn fast and efficiently.
  • Tanh: This curve goes up and down like a wave. It's useful when your data is centered around zero and you need strong negative activations.

Why Nonlinearity is Important

Nonlinear functions like these are crucial because they allow neural networks to learn and model complex patterns. If we only used linear functions, the network wouldn't be able to learn much apart from simple straight-line relationships. Nonlinearity lets the network combine features in complex ways, enabling it to perform tasks like recognizing objects in photos or understanding spoken words.

Understanding these functions is key to unlocking the power of neural networks and making AI do amazing things!

2.3 Choosing the Right Activation Function

Selecting the appropriate activation function depends on the specific problem and the characteristics of your data. Here are a few guidelines:

  • Sigmoid and tanh functions are commonly used in the output layer for binary classification problems.
  • ReLU is a popular choice for hidden layers as it helps alleviate the vanishing gradient problem and promotes sparsity in the network.
  • Softmax function is often used in the output layer for multi-class classification problems, as it outputs a probability distribution over the classes.

Experimenting with different activation functions and observing their impact on model performance is a crucial part of designing effective neural networks.

3. Training Neural Networks

3.1 Forward Propagation

Forward propagation is the process of passing the input data through the neural network to obtain the predicted output. At each layer, the input is multiplied by the weights of the connections, summed up, bias added and passed through the activation function. This process continues until the final output layer is reached.

Let's consider a simple example of forward propagation in a neural network with one hidden layer:

from keras.models import Sequential
from keras.layers import Dense
import numpy as np

# Create a sequential model
model = Sequential()

# Add an input layer and a hidden layer with 5 nodes and ReLU activation
model.add(Dense(5, activation='relu', input_shape=(3,)))

# Add an output layer with 1 node and sigmoid activation
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')

# Perform forward propagation
input_data = np.array([[0.5, 0.8, 0.2]])
output = model.predict(input_data)
print(output)
# Generate and print a summary of the model
model.summary()
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 59ms/step
[[0.45069867]]
Model: "sequential_2"
┌─────────────────────────────────┬────────────────────────┬───────────────┐
│ Layer (type)                    │ Output Shape           │       Param # │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 5)              │            20 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_6 (Dense)                 │ (None, 1)              │             6 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 26 (104.00 B)
 Trainable params: 26 (104.00 B)
 Non-trainable params: 0 (0.00 B)

In this example, we create a sequential model with one hidden layer and an output layer. We perform forward propagation by calling the predict() method on the model, passing in the input data. The output is the predicted value based on the input.

This code snippet demonstrates how to set up and run a simple neural network using Keras, a popular library for building neural networks. Let's break down what we learn from this code and its output, keeping it straightforward for easier understanding:

What the Code Does:

  1. Builds a Neural Network Model:
    • Sequential Model: This means the network is made up of layers stacked one after the other. You can think of it like a line of workers in an assembly line, where each worker does one specific thing.
    • Input and Hidden Layer: The network first includes a layer (Dense(5, activation='relu', input_shape=(3,))) with 5 nodes (neurons) that uses the ReLU activation function. This layer expects input data with 3 features.
    • Output Layer: Then, it adds another layer (Dense(1, activation='sigmoid')) with just one node that uses the sigmoid activation function, making it suitable for binary classification tasks (like deciding if an image shows a cat or not).
  2. Compiles the Model:
    • The model is compiled with an optimizer ('adam') and a loss function ('binary_crossentropy'), which are tools the network uses to improve its predictions during training.
  3. Performs Forward Propagation:
    • The network processes an input (input_data = np.array([[0.5, 0.8, 0.2]])) to produce an output by calling the predict() method on the model. This step is where the network uses its layers to make a prediction based on the input data.
  4. Outputs:
    • The output ([[0.45069867]]) is the network’s prediction given the input data. Since it’s close to 0.5, it suggests uncertainty in the decision (if 0 is one category and 1 is another, like "no" or "yes"). It’s obvious because we haven’t trained the model, thus, it’s initialized randomly.
    • The model summary details each layer's properties, like shape and number of parameters (weights and biases).

What We Learn from the Output:

  • Understanding Layers: The summary shows how each layer contributes to the model. For instance, the first dense layer with 5 nodes has 20 parameters (15 weights from 3 inputs to 5 nodes + 5 biases for each node).
  • Importance of Activation Functions: ReLU helps the model avoid problems in training that occur when gradients are too small, and sigmoid is used to squeeze the output between 0 and 1, which is helpful for making binary decisions.

Why Nonlinearity is Important:

Nonlinearity, introduced by activation functions like ReLU and sigmoid, allows the model to learn more complex patterns than just straight line relationships. Without nonlinearity, our neural network would struggle to solve problems beyond simple regressions or classifications.

By understanding this code, you gain insight into how neural networks are structured and function, which is fundamental to using AI to solve real-world problems. It's like learning how each part of a car works before you can drive it efficiently or fix it when something goes wrong!

3.2 Backpropagation: Learning from Mistakes

Backpropagation is the key algorithm that enables neural networks to learn from their mistakes and improve their performance. It involves computing the gradients of the loss function with respect to the weights and biases of the network, and updating them in the direction that minimizes the loss.

The process of backpropagation can be summarized in the following steps:

  1. Forward propagation: Pass the input through the network to obtain the predicted output.
  2. Calculate the loss: Measure the discrepancy between the predicted output and the actual target using a loss function.
  3. Backward propagation: Compute the gradients (function derivatives) of the loss function with respect to the weights and biases, starting from the output layer and propagating backwards through the network.
  4. Update the weights and biases: Adjust the weights and biases using an optimization algorithm, such as gradient descent, to minimize the loss.

Here's a simplified example of training a neural network using backpropagation:

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import numpy as np

# Create a sequential model
model = Sequential()

# Add an input layer and a hidden layer with 5 nodes and ReLU activation
model.add(Dense(5, activation='relu', input_shape=(3,)))

# Add an output layer with 1 node and sigmoid activation
model.add(Dense(1, activation='sigmoid'))

# Compile the model with Adam optimizer and binary cross-entropy loss
model.compile(optimizer=Adam(learning_rate=0.01), loss='binary_crossentropy')

# Prepare the training data
input_data = np.array([[0.5, 0.8, 0.2], [0.1, 0.9, 0.6], [0.3, 0.4, 0.7]])
target_data = np.array([[1], [0], [1]])

# Train the model using backpropagation
model.fit(input_data, target_data, epochs=100)
...
Epoch 91/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 79ms/step - loss: 0.3372
Epoch 92/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 84ms/step - loss: 0.3328
Epoch 93/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 66ms/step - loss: 0.3285
Epoch 94/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step - loss: 0.3243
Epoch 95/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 41ms/step - loss: 0.3199
Epoch 96/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 41ms/step - loss: 0.3154
Epoch 97/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step - loss: 0.3113
Epoch 98/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step - loss: 0.3071
Epoch 99/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 27ms/step - loss: 0.3029
Epoch 100/100
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step - loss: 0.2984

In this example, we create a sequential model and compile it with the Adam optimizer and binary cross-entropy loss. We prepare the training data consisting of input samples and their corresponding target labels. Finally, we train the model using the fit() method, specifying the input data, target data, and the number of epochs (iterations) for training. During each epoch, the model performs forward propagation, calculates the loss, and updates the weights and biases using backpropagation.

3.3 Optimization Algorithms

Optimization algorithms play a crucial role in training neural networks efficiently. They determine how the weights and biases are updated based on the gradients computed during backpropagation. Some commonly used optimization algorithms include:

  1. Gradient Descent: Gradient descent is a basic optimization algorithm that updates the weights and biases in the direction of the negative gradient of the loss function. It takes steps proportional to the learning rate to minimize the loss.
  2. Stochastic Gradient Descent (SGD): SGD is a variant of gradient descent that updates the weights and biases based on a single training example at a time, rather than the entire dataset. It is computationally more efficient and can escape local minima as its convergence is more noisy/stochastic.
  3. Adam (Adaptive Moment Estimation): Adam is an adaptive optimization algorithm that combines the advantages of both momentum and adaptive learning rates. It maintains separate learning rates for each parameter and adapts them based on the historical gradients.

Here's an example of using the Adam optimizer in Keras:

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

# Create a sequential model
model = Sequential()

# Add layers to the model
model.add(Dense(10, activation='relu', input_shape=(5,)))
model.add(Dense(1, activation='sigmoid'))

# Compile the model with Adam optimizer
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy')

In this example, we create a sequential model and compile it with the Adam optimizer, specifying a learning rate of 0.001.

3.4 Hyperparameter Tuning

Hyperparameters are the settings that define the architecture and training process of a neural network. They include the number of layers, number of nodes in each layer, learning rate, batch size, and more. Tuning these hyperparameters is essential for achieving optimal performance.

Some common techniques for hyperparameter tuning include:

  1. Grid Search: Grid search involves specifying a range of values for each hyperparameter and exhaustively trying all possible combinations. It can be computationally expensive but guarantees finding the best combination within the specified ranges.
  2. Random Search: Random search randomly samples hyperparameter values from a predefined distribution. It is more efficient than grid search and can often find good hyperparameter settings with fewer iterations.
  3. Bayesian Optimization: Bayesian optimization is a more advanced technique that uses a probabilistic model to guide the search for optimal hyperparameters. It balances exploration and exploitation to efficiently find the best settings.

Here's an example of performing a grid search for hyperparameter tuning using Keras:

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV

# Define the model architecture
def create_model(neurons, activation):
    model = Sequential()
    model.add(Input(shape=(5,)))
    model.add(Dense(neurons, activation=activation))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy')
    return model

# Create the model
neurons = 10
activation = 'relu'

model = KerasClassifier(model=create_model, neurons=neurons, activation=activation, verbose=0)

# Define the hyperparameter grid
param_grid = {
    'batch_size': [10, 20],
    'epochs': [50, 100]
}

# Dummy training data (replace with your actual data)
X_train = np.random.rand(100, 5)
y_train = np.random.randint(2, size=(100,))

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print("Best parameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)
Best parameters:  {'batch_size': 20, 'epochs': 50}
Best score:  0.4806892453951277

In this example, we define a function called create_model that creates a neural network with configurable hyperparameters. We wrap the model using KerasClassifier to make it compatible with scikit-learn's GridSearchCV. We define the hyperparameter grid, specifying the values to search for each hyperparameter. Finally, we perform the grid search using GridSearchCV, which trains and evaluates the model for each combination of hyperparameters and returns the best settings.

4. Deep Learning Architectures

4.1 Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a specialized type of neural network designed for processing grid-like data, such as images. They have proven to be highly effective in tasks like image classification, object detection, and segmentation.

The key components of a CNN are:

  • Convolutional Layers: These layers apply a set of learnable filters to the input, capturing spatial patterns and features. The filter are “scanning” through the image and the result is propagated forward to create a “smaller” feature-map (think smaller image)
Illustration of the CNN scanning the RGB image
Illustration of the CNN scanning the RGB image
  • Pooling Layers: Pooling layers down-sample the feature maps, reducing spatial dimensions and providing translation invariance. Below is the operation of max-pooling and average pooling illustrated in a figure.
  • Fully Connected Layers: These layers perform the final classification or regression task based on the extracted features.

Here's an example of building a simple CNN using Keras:

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Input

# Create a sequential model
model = Sequential()

# Add an Input layer
model.add(Input(shape=(28, 28, 1)))

# Add convolutional and pooling layers
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))

# Flatten the feature maps
model.add(Flatten())

# Add fully connected layers
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In this example, we create a sequential model and add convolutional layers with 32 and 64 filters, followed by max pooling layers. We flatten the feature maps and add fully connected layers for classification. Finally, we compile the model with the Adam optimizer, categorical cross-entropy loss, and accuracy metric.

4.2 Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are designed to handle sequential data, such as time series or natural language. They have connections that loop back, allowing information to persist across time steps.

The most common types of RNNs are:

  • Simple RNN: It has a single hidden state that is updated at each time step based on the current input and the previous hidden state.
  • Long Short-Term Memory (LSTM): LSTMs introduce gating mechanisms to control the flow of information, enabling them to capture long-term dependencies.
  • Gated Recurrent Unit (GRU): GRUs are a simpler variant of LSTMs, with fewer parameters and similar performance.
image

RNN Unrolled Diagram: The serial (cascaded) operation of a RNN. Credit: Illustrated Guide to LSTM’s and GRU’s: A step by step explanation

Here's an example of building an LSTM-based RNN using Keras:

from keras.models import Sequential
from keras.layers import LSTM, Dense, Input

# Create a sequential model
model = Sequential()

# Add an Input layer
model.add(Input(shape=(10, 1)))

# Add an LSTM layer
model.add(LSTM(64))

# Add a dense output layer
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

In this example, we create a sequential model and add an LSTM layer with 64 units. The input shape is specified as (10, 1), meaning 10 time steps with 1 feature per time step. Finally, we add a dense output layer and compile the model with the Adam optimizer and mean squared error loss function.