Chapter 5: Unsupervised Learning: Clustering and Dimensionality Reduction

Welcome to the fascinating world of unsupervised learning! In this chapter, we'll explore two fundamental techniques in unsupervised learning: clustering and dimensionality reduction. Unsupervised learning is a branch of machine learning where the goal is to discover hidden patterns, structures, or relationships in unlabeled data. Unlike supervised learning, where we have labeled examples to learn from, unsupervised learning algorithms work with data without explicit target variables or known outcomes.

Clustering is a technique used to group similar data points together based on their features. It helps in identifying natural groupings within the data. Dimensionality reduction is the process of reducing the number of features or dimensions in a dataset while retaining its essential information. This is useful for visualization and simplifying the data for further analysis.

We'll dive into three popular unsupervised learning techniques: K-Means clustering, Principal Component Analysis (PCA), and t-SNE for data visualization. These techniques are widely used in various domains, including customer segmentation, anomaly detection, image compression, and data exploration. We'll explore the intuition behind each technique, dive into their mathematical foundations, and provide code examples to illustrate their implementation using Python and popular libraries like scikit-learn.

Real-World Introduction

Imagine you're the owner of a popular online store. Every day, you get tons of data about your customers' shopping habits, but you don't know how to make sense of it all. By using unsupervised learning techniques like clustering and dimensionality reduction, you can discover groups of similar customers and reduce the complexity of your data, making it easier to analyze and visualize.

K-Means Clustering

Intuition behind K-Means Clustering

Imagine you have a large collection of data points, and you want to group similar points together. K-Means clustering is an algorithm that helps you achieve this goal. It aims to partition the data into K clusters, where each data point belongs to the cluster with the nearest mean (centroid). A centroid is the center of a cluster, representing the average position of all the points in the cluster.

Think of it like organizing your wardrobe. You have a bunch of clothes, and you want to group them based on their similarity. You might have clusters for shirts, pants, dresses, and so on. K-Means clustering works in a similar way, but instead of clothes, it groups data points based on their features. Features are the measurable properties or characteristics of the data.

Algorithm Steps

Choose the number of clusters (K) you want to create.
Randomly initialize K centroids (cluster centers) in the feature space. The feature space is the multi-dimensional space defined by the dataset's features.
Assign each data point to the nearest centroid based on a distance metric (e.g., Euclidean distance). A distance metric measures the distance between points in the feature space.
Update the centroids by calculating the mean of all data points assigned to each cluster.
Repeat steps 3 and 4 until the centroids no longer change significantly or a maximum number of iterations is reached.

Code Example

Let's see how to implement K-Means clustering using scikit-learn:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data
X, _ = make_blobs(n_samples=200, centers=4, random_state=42)

# Create a KMeans object with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=42)

# Fit the model to the data
kmeans.fit(X)

# Get the cluster labels for each data point
labels = kmeans.labels_

# Get the cluster centers
centroids = kmeans.cluster_centers_

print("Cluster Labels:", labels)
print("Cluster Centers:", centroids)

Cluster Labels: [0 1 3 3 2 2 0 3 0 2 2 0 0 2 2 2 1 3 2 2 2 2 3 1 3 1 1 2 1 0 2 2 3 3 1 0 3
 0 3 1 2 1 2 2 3 0 0 2 0 1 3 1 3 0 1 1 2 2 1 0 3 0 2 3 3 2 0 1 3 1 1 3 1 2
 0 2 0 1 2 1 1 0 2 3 3 3 3 1 0 3 2 1 0 0 0 3 1 0 2 1 3 3 1 2 1 0 3 2 2 3 0
 2 1 3 1 3 3 1 1 1 3 2 0 3 3 0 1 0 0 1 2 2 1 3 3 0 2 2 1 2 0 1 3 0 0 1 0 3
 2 2 1 3 0 3 2 3 3 0 0 0 1 0 0 3 1 2 0 0 2 0 3 1 2 2 0 2 0 1 1 2 1 2 3 3 3
 1 0 0 0 1 1 2 3 3 1 3 0 1 2 0]
Cluster Centers: [[ 4.58407676  2.1431444 ]
 [-2.70146566  8.90287872]
 [-6.75399588 -6.88944874]
 [-8.74950999  7.40771124]]

In this example, we generate sample data using the make_blobs function from scikit-learn, which creates clusters of points with Gaussian distributions. We then create a KMeans object with 4 clusters and fit it to the data using the fit method. After fitting, we can obtain the cluster labels for each data point using kmeans.labels_ and the cluster centers using kmeans.cluster_centers_.

Choosing the Number of Clusters (K)

One important consideration in K-Means clustering is choosing the appropriate number of clusters (K). There are various methods to determine the optimal value of K, such as the elbow method or silhouette analysis.

The elbow method involves plotting the within-cluster sum of squares (WCSS) against different values of K. WCSS measures the compactness of the clusters. As K increases, the WCSS tends to decrease. The idea is to choose the value of K at the "elbow" point, where the rate of decrease in WCSS slows down significantly.

Here's an example of using the elbow method:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate sample data
X, _ = make_blobs(n_samples=200, centers=4, random_state=42)

# Compute WCSS for different values of K
wcss = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(range(1, 11), wcss)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()

Elbow method. k=3

In this example, we compute the WCSS for different values of K (from 1 to 10) and plot the elbow curve. The point where the curve starts to flatten indicates a good choice for the number of clusters.

Principal Component Analysis (PCA)

Intuition behind PCA

Imagine you have a dataset with a large number of features (dimensions). Each feature represents a different aspect or measurement of the data. However, some of these features might be redundant or highly correlated with each other, which means that information of one feature is present in others so we can discard this redundant information. PCA is a technique that helps us reduce the dimensionality of the data while retaining most of the important information.

Think of it like packing a suitcase. You have a limited amount of space, so you want to pack the most essential items that represent your needs. PCA works similarly by identifying the most important directions (principal components) in the data that capture the maximum variance (or information). By projecting the data onto these principal components, we can reduce the dimensionality while preserving the essential structure of the data.

Mathematical Foundation

PCA is based on the concept of eigenvalues and eigenvectors. An eigenvalue is a number that indicates how much variance is captured by a corresponding eigenvector, which is a direction in the feature space. PCA finds the principal components by solving an eigenvalue problem on the covariance matrix of the data. The covariance matrix is a matrix that captures the pairwise relationships (covariances) between the features.

The steps involved in PCA are as follows:

Standardize the data by subtracting the mean and scaling to unit variance.
Compute the covariance matrix of the standardized data.
Find the eigenvectors and eigenvalues of the covariance matrix.
Sort the eigenvectors in descending order based on their corresponding eigenvalues.
Choose the top k eigenvectors as the principal components.
Project the data onto the selected principal components to obtain the transformed data in the reduced dimensional space.

Code Example

Let's see how to perform PCA using scikit-learn:

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a PCA object with 2 components
pca = PCA(n_components=2)

# Fit and transform the data
X_pca = pca.fit_transform(X_scaled)

print("Original shape:", X.shape)
print("Transformed shape:", X_pca.shape)
print("Explained variance ratio:", pca.explained_variance_ratio_)

Original shape: (150, 4)
Transformed shape: (150, 2)
Explained variance ratio: [0.72962445 0.22850762]

In this example, we load the Iris dataset and standardize the features using StandardScaler. We then create a PCA object with 2 components and fit and transform the data using the fit_transform method. The transformed data X_pca represents the data in the reduced dimensional space. We can also obtain the explained variance ratio, which indicates the proportion of variance explained by each principal component.

Choosing the Number of Components

One important aspect of PCA is determining the number of principal components to retain. There are several approaches to choose the appropriate number of components, such as the cumulative explained variance ratio or the elbow method.

The cumulative explained variance ratio represents the proportion of the total variance explained by the selected principal components. We can plot the cumulative explained variance ratio against the number of components and choose the number of components that capture a desired level of variance (e.g., 90% or 95%).

Here's an example of using the cumulative explained variance ratio:

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a PCA object with all components
pca = PCA()

# Fit the data
pca.fit(X_scaled)

# Plot the cumulative explained variance ratio
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         pca.explained_variance_ratio_.cumsum())
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance Ratio')
plt.show()

We see that 2 components represent more that 90% of the total variance or information

In this example, we create a PCA object without specifying the number of components, allowing it to compute all possible components. We then plot the cumulative explained variance ratio against the number of components. The plot helps us choose the number of components that capture a sufficient amount of variance in the data.

t-SNE for Data Visualization

Intuition behind t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a powerful technique for visualizing high-dimensional data in a lower-dimensional space (typically 2D or 3D). It aims to preserve the local structure of the data while also revealing global patterns and clusters.

Imagine you have a complex dataset with many features, and you want to visualize it in a way that captures the relationships between the data points. t-SNE works by calculating the similarity between data points in the high-dimensional space and then finding a low-dimensional representation that preserves these similarities.

Think of it like creating a map of a city. Each data point represents a location in the city, and the relationships between the data points represent the proximity or similarity between the locations. t-SNE tries to create a simplified map that preserves the relative distances and neighborhoods of the locations, making it easier to visualize and understand the overall structure of the city.

Mathematical Foundation

t-SNE is based on the concept of probability distributions. It starts by converting the distances between data points in the high-dimensional space into conditional probabilities that represent similarities. The closer two data points are, the higher their conditional probability.

The algorithm then tries to find a low-dimensional representation of the data where the conditional probabilities in the low-dimensional space match the conditional probabilities in the high-dimensional space. It does this by minimizing a cost function called the Kullback-Leibler (KL) divergence, which measures the difference (think distance) between two probability distributions.

The optimization process is performed using gradient descent, iteratively adjusting the positions of the data points in the low-dimensional space to minimize the KL divergence. Gradient descent is an optimization algorithm that finds the minimum of a function by iteratively moving in the direction of the steepest descent.

Code Example

n the following example we see how t_SNE is transforming a 4 dimension space to w 2-dim space

The Iris dataset is used in the example, which consists of measurements for 150 iris flowers from three different species. Each flower is described by four features:

Sepal length
Sepal width
Petal length
Petal width

Therefore, the original dimensionality of the data is 4, as each data point (iris flower) is represented by a 4-dimensional feature vector.

Let's see how to apply t-SNE using scikit-learn:

from sklearn.manifold import TSNE
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a t-SNE object with 2 components
tsne = TSNE(n_components=2, random_state=42)

# Fit and transform the data
X_tsne = tsne.fit_transform(X)

# Plot the t-SNE visualization
plt.figure(figsize=(8, 6))
colors = ['red', 'green', 'blue']
for i in range(3):
    plt.scatter(X_tsne[y == i, 0], X_tsne[y == i, 1], color=colors[i], label=iris.target_names[i])
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE Visualization of Iris Dataset')
plt.legend()
plt.show()

t-SNE 2D visualization of a 4D space

In this example, we load the Iris dataset and create a t-SNE object with 2 components. We then fit and transform the data using the fit_transform method, which returns the transformed data in the low-dimensional space. Finally, we plot the t-SNE visualization using scatter plots, where each data point is colored based on its true class label.

Interpreting t-SNE Visualization

When interpreting a t-SNE visualization, it's important to keep a few things in mind:

t-SNE preserves the local structure of the data, so data points that are close together in the high-dimensional space tend to be close together in the low-dimensional representation.
The global structure and distances between clusters in the t-SNE visualization may not always reflect the actual distances in the high-dimensional space. t-SNE focuses more on preserving the local neighborhoods.
The initialization of t-SNE is random, so different runs of the algorithm may produce slightly different visualizations. It's recommended to run t-SNE multiple times and compare the results.
t-SNE is computationally expensive, especially for large datasets. It may take some time to compute the visualization.

Despite these limitations, t-SNE is a powerful tool for exploring and visualizing high-dimensional data. It can help identify clusters, outliers, and patterns in the data that may not be apparent in the original feature space.

Homework Problems: t-SNE Dimensionality Reduction

Problem 5.1: (Solution here)

You have learned about t-SNE for visualizing high-dimensional data. Now, let's apply what you have learned.

Modify the provided code to reduce the dimensionality of the Iris dataset to 1-D and present it in a 1-D graph.
Modify the provided code to reduce the dimensionality of the Iris dataset to 3-D and present it in a 3-D graph.

Conclusion

In this chapter, we explored three fundamental techniques in unsupervised learning: K-Means clustering, Principal Component Analysis (PCA), and t-SNE for data visualization. We delved into the intuition behind each technique, their mathematical foundations, and provided code examples to illustrate their implementation using scikit-learn.

K-Means clustering helps us group similar data points together based on their proximity in the feature space. It's a simple and effective algorithm for discovering clusters and patterns in unlabeled data.

PCA allows us to reduce the dimensionality of the data while preserving the most important information. By identifying the principal components that capture the maximum variance, we can transform the data into a lower-dimensional space, making it easier to analyze and visualize.

t-SNE is a powerful technique for visualizing high-dimensional data in a lower-dimensional space. It preserves the local structure of the data and reveals global patterns and clusters. t-SNE is particularly useful for exploring and understanding complex datasets.

Unsupervised learning techniques are valuable tools in a data scientist's toolbox. They enable us to explore and gain insights from unlabeled data, discover hidden structures, and make informed decisions based on the underlying patterns.

As you continue your journey in machine learning, you'll encounter more advanced unsupervised learning techniques, such as hierarchical clustering, density-based clustering (DBSCAN), and autoencoders. Each technique has its own strengths and applications, and the choice of the appropriate technique depends on the specific characteristics of the data and the problem at hand.

Remember, unsupervised learning is an iterative process. It often involves experimenting with different algorithms, tuning hyperparameters, and interpreting the results in the context of the domain knowledge. Visualization plays a crucial role in unsupervised learning, as it helps us understand and communicate the findings effectively.

So, keep exploring, experimenting, and learning! The world of unsupervised learning is vast and fascinating, and there's always more to discover and apply in real-world scenarios.