Mathematics of Dimensionality Reduction: PCA, t-SNE, UMAP, and Autoencoders

Raj Shaikh 9 min read 1707 words

1. Principal Component Analysis (PCA): Finding the Data’s Hidden Axes 🧭

What Is PCA?

PCA is a statistical method that reduces the dimensionality of data by finding the principal components, which are directions where the data varies the most. In other words:

It’s like pointing a flashlight at your data cloud and projecting its shadow onto a lower-dimensional surface. 🌟

How PCA Works: The Math Journey

Step 1: Mean Center the Data

Shift the data so its mean is zero. For a dataset \( X \):

\[ X_{\text{centered}} = X - \mu \]

Where \( \mu \) is the mean of the dataset.

Step 2: Compute the Covariance Matrix

The covariance matrix captures how features vary together:

\[ \Sigma = \frac{1}{n} X_{\text{centered}}^T X_{\text{centered}} \]

This matrix tells us which directions the data is most spread out.

Step 3: Eigen Decomposition

Find the eigenvectors and eigenvalues of the covariance matrix:

\[ \Sigma v = \lambda v \]

Where:

\( v \): Eigenvector (direction of maximum variance).
\( \lambda \): Eigenvalue (magnitude of variance along \( v \)).

Step 4: Choose Principal Components

Select the top \( k \) eigenvectors corresponding to the largest eigenvalues. These eigenvectors are the principal components.

Step 5: Project Data

Project the data onto the new lower-dimensional subspace:

\[ X_{\text{reduced}} = X_{\text{centered}} \cdot V_k \]

Where \( V_k \) contains the top \( k \) eigenvectors.

Numerical Example

Suppose we have a simple 2D dataset:

\[ X = \begin{bmatrix} 2 & 4 \\ 3 & 5 \\ 4 & 6 \\ 5 & 7 \end{bmatrix} \]

Mean Center the Data:
- Compute \( \mu = [3.5, 5.5] \).
- Subtract \( \mu \): \[ X_{\text{centered}} = \begin{bmatrix} -1.5 & -1.5 \\ -0.5 & -0.5 \\ 0.5 & 0.5 \\ 1.5 & 1.5 \end{bmatrix} \]
Covariance Matrix:
\[ \Sigma = \frac{1}{4} X_{\text{centered}}^T X_{\text{centered}} = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} \]
Eigen Decomposition:
- Eigenvalues: \( \lambda_1 = 2, \lambda_2 = 0 \).
- Eigenvectors: \( v_1 = [1, 1]^T, v_2 = [-1, 1]^T \).
Project Data:
- Use \( v_1 = [1, 1]^T \) to project data onto 1D: \[ X_{\text{reduced}} = X_{\text{centered}} \cdot v_1 = \begin{bmatrix} -3 \\ -1 \\ 1 \\ 3 \end{bmatrix} \]

Why PCA Is Awesome

Simplifies Data:
- Reduces dimensions while preserving variability.
De-noises Data:
- Eliminates small, irrelevant variations.
Visualization:
- Great for plotting high-dimensional data in 2D or 3D.

Challenges of PCA

Linear Assumption:
- PCA assumes data is linearly separable.
Sensitive to Scaling:
- Features with different scales can skew results.
- Solution: Standardize the data before PCA.

Code Example: PCA in Python

Here’s how to perform PCA using NumPy:

import numpy as np
from sklearn.decomposition import PCA

# Example data
X = np.array([[2, 4], [3, 5], [4, 6], [5, 7]])

# Perform PCA using NumPy
X_centered = X - np.mean(X, axis=0)
cov_matrix = np.cov(X_centered, rowvar=False)
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Project onto the first principal component
principal_component = eigenvectors[:, 0]
X_reduced = np.dot(X_centered, principal_component)

print("Reduced Data:\n", X_reduced)

# Alternatively, use sklearn
pca = PCA(n_components=1)
X_sklearn_reduced = pca.fit_transform(X)
print("Reduced Data (sklearn):\n", X_sklearn_reduced)

Fun Analogy

PCA is like squishing a watermelon 🍉 into a juice box:

The watermelon (original data) has all the information.
You find the juiciest parts (principal components).
You throw away the rind (low-variance features).
The juice box (reduced data) is compact but still delicious! 🧃

Mermaid.js Diagram: PCA Workflow

graph TD
    InputData[Input Data] --> Center["Center Data (Subtract Mean)"]
    Center --> Covariance[Compute Covariance Matrix]
    Covariance --> EigenDecomposition["Eigen Decomposition (Find Eigenvectors/Values)"]
    EigenDecomposition --> SelectPCs[Select Top Principal Components]
    SelectPCs --> Project[Project Data Onto PCs]
    Project --> OutputData[Reduced Data]

2. t-SNE and UMAP: The Wizards of Non-Linear Dimensionality Reduction 🧙‍♂️✨

What Are t-SNE and UMAP?

t-SNE:

Focuses on local structure by preserving pairwise similarities.
Great for visualizing high-dimensional data in 2D or 3D.
Think of it as compressing a map 🌍 but keeping cities close to their neighbors.

UMAP:

Similar to t-SNE but faster and better at preserving global structure.
It’s like t-SNE’s speedier, more precise sibling. 🚀

1. t-SNE: Trusting Your Neighbors in Data 👫

t-SNE tries to keep similar points close and dissimilar points far apart by creating two probability distributions:

High-Dimensional Space:
- Compute the probability of similarity between points \( i \) and \( j \) using a Gaussian kernel: \[ P_{ij} = \frac{\exp(-\|x_i - x_j\|^2 / 2 \sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2 \sigma_i^2)} \]
Low-Dimensional Space:
- Define another similarity probability \( Q_{ij} \) using a Student-t distribution: \[ Q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}} \]
KL Divergence:
- Minimize the difference between \( P_{ij} \) and \( Q_{ij} \): \[ \mathcal{L} = \sum_{i \neq j} P_{ij} \log \frac{P_{ij}}{Q_{ij}} \]

This creates a low-dimensional embedding where similar points in high-dimensional space remain close.

Why t-SNE Rocks

Intuitive Clustering:
- Groups similar points together.
Beautiful Visualizations:
- Creates visually appealing 2D/3D representations.

Limitations:

Computationally expensive.
Struggles with preserving global structure.

2. UMAP: The Faster Magician 🎩

UMAP builds on t-SNE’s ideas but uses:

Fuzzy Set Theory:
- Constructs a fuzzy graph of the high-dimensional data.
Optimization:
- Learns a low-dimensional embedding by minimizing the cross-entropy between fuzzy sets.

The UMAP Math (Simplified):

High-Dimensional Graph:
- Define a graph where edges represent probabilities of connectivity.
Low-Dimensional Graph:
- Optimize the layout of a lower-dimensional graph to match the structure.

Why UMAP Shines

Speed:
- Much faster than t-SNE, especially for large datasets.
Preserves Global Structure:
- Maintains relationships between clusters.

Numerical Example: t-SNE in Action

Let’s visualize a dataset with three clusters.

High-Dimensional Data:
\[ X = \text{3 clusters of points in 50 dimensions.} \]
Compute Probabilities:
- \( P_{ij} \): Similarity in high dimensions.
- \( Q_{ij} \): Similarity in low dimensions.
KL Divergence:
- Minimize \( \mathcal{L} \) to adjust low-dimensional points.

Result: A 2D scatter plot where clusters are clearly separated.

Code Example: t-SNE and UMAP

Here’s how to visualize data using t-SNE and UMAP in Python:

from sklearn.manifold import TSNE
import umap.umap_ as umap
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Create synthetic data
X, y = make_blobs(n_samples=500, n_features=50, centers=3, random_state=42)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Apply UMAP
umap_model = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(X)

# Plot
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
plt.title("t-SNE")
plt.subplot(1, 2, 2)
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis')
plt.title("UMAP")
plt.show()

Fun Analogy

t-SNE and UMAP are like Google Maps vs. Waze 🚗:

t-SNE: Carefully calculates local routes, ensuring neighborhoods are preserved. 🌆
UMAP: Balances local routes and highways, giving you a broader and faster perspective. 🌍

Mermaid.js Diagram: t-SNE and UMAP Workflow

graph TD
    HighDim[High-Dimensional Data] --> Similarity[Pij: High-Dimensional Similarity]
    Similarity --> LowDim[Optimize Low-Dimensional Space]
    LowDim --> tSNE[t-SNE]
    LowDim --> UMAP[UMAP]
    tSNE --> Visualization1[Visualize Clusters]
    UMAP --> Visualization2[Visualize Clusters]

3. Autoencoders for Compression: The Marie Kondos of Data 🧳✨

What Are Autoencoders?

An Autoencoder is a type of neural network that learns to compress data into a smaller representation (latent space) and then reconstruct it back. Think of it as a zip-unzip mechanism for data:

Encoder: Squishes data into a compact representation.
Decoder: Reconstructs the original data from this representation.

How Autoencoders Work: The Neural Workflow

Autoencoders have two main parts:

Encoder:
- Maps the input \( X \) into a lower-dimensional latent representation \( Z \): \[ Z = f_\theta(X) \]
- Example: A 1000-dimensional image becomes a 50-dimensional vector.
Decoder:
- Reconstructs \( X' \) from \( Z \): \[ X' = g_\phi(Z) \]
Loss Function:
- Measures how close \( X' \) (reconstructed data) is to \( X \) (original data): \[ \mathcal{L} = \|X - X'\|^2 \]

Types of Autoencoders

1. Basic Autoencoder

The classic version with fully connected layers, used for straightforward dimensionality reduction.

2. Convolutional Autoencoder

Specialized for image data, using convolutional layers for spatial understanding.

3. Variational Autoencoder (VAE)

Adds a probabilistic twist to the latent space, ensuring smoothness for generative tasks.

Numerical Example: Compressing a 2D Dataset

Imagine a 2D dataset:

\[ X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \]

Step 1: Encoder

The encoder reduces the data to 1D using weights \( W \) and biases \( b \):

\[ Z = X \cdot W + b \]

Suppose:

\[ W = \begin{bmatrix} 0.5 \\ 0.5 \end{bmatrix}, \quad b = 0 \]

Then:

\[ Z = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \\ 0.5 \end{bmatrix} = \begin{bmatrix} 1.5 \\ 3.5 \\ 5.5 \end{bmatrix} \]

Step 2: Decoder

The decoder reconstructs \( X' \) using weights \( W' \):

\[ X' = Z \cdot W'^T \]

Suppose:

\[ W' = \begin{bmatrix} 0.5 & 0.5 \end{bmatrix} \]

Then:

\[ X' = \begin{bmatrix} 1.5 \\ 3.5 \\ 5.5 \end{bmatrix} \cdot \begin{bmatrix} 0.5 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.75 & 0.75 \\ 1.75 & 1.75 \\ 2.75 & 2.75 \end{bmatrix} \]

Step 3: Loss

Compute the reconstruction loss:

\[ \mathcal{L} = \|X - X'\|^2 \]

Why Autoencoders Are Cool

Data Compression:
- Reduces high-dimensional data to compact representations.
Noise Reduction:
- Cleans noisy data by reconstructing the clean version.
Feature Extraction:
- Learns meaningful latent features for downstream tasks.

Challenges of Autoencoders

Overfitting:
- The network may memorize instead of generalizing.
- Solution: Add regularization or use sparse autoencoders.
Non-Interpretable Latent Space:
- Latent features aren’t always intuitive.
- Solution: Use Variational Autoencoders (VAEs).

Code Example: Autoencoder in PyTorch

Here’s a simple Autoencoder for dimensionality reduction:

import torch
import torch.nn as nn
import torch.optim as optim

# Define Autoencoder
class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Linear(input_dim, latent_dim)
        self.decoder = nn.Linear(latent_dim, input_dim)
    
    def forward(self, x):
        z = self.encoder(x)
        x_reconstructed = self.decoder(z)
        return x_reconstructed

# Example data
input_dim = 4
latent_dim = 2
model = Autoencoder(input_dim, latent_dim)

# Example input
data = torch.tensor([[1.0, 2.0, 3.0, 4.0]])
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop (1 epoch for simplicity)
model.train()
optimizer.zero_grad()
reconstructed = model(data)
loss = criterion(reconstructed, data)
loss.backward()
optimizer.step()

print("Reconstructed Data:", reconstructed.detach().numpy())
print("Loss:", loss.item())

Fun Analogy

An Autoencoder is like packing a suitcase 🧳:

Encoder: Carefully folds your clothes (compresses data).
Latent Space: The compact suitcase holds only the essentials.
Decoder: Unpacks the suitcase when you reach your destination (reconstructs data). 🎒

Mermaid.js Diagram: Autoencoder Workflow

graph TD
    Input[Input Data X] --> Encoder[Encoder: Reduce Dimensions]
    Encoder --> LatentSpace[Latent Space Z]
    LatentSpace --> Decoder[Decoder: Reconstruct Data]
    Decoder --> Output[Reconstructed Data X']
    Output --> Loss[Compute Reconstruction Loss]

Last updated on February 28, 2025

Mathematics of Neural Networks: Backpropagation, Activation Functions, and Weight Initialization Mathematics Behind Reinforcement Learning: Key Concepts and Applications