Mathematics of Dimensionality Reduction: PCA, t-SNE, UMAP, and Autoencoders
Raj Shaikh 9 min read 1707 words1. Principal Component Analysis (PCA): Finding the Data’s Hidden Axes 🧭
What Is PCA?
PCA is a statistical method that reduces the dimensionality of data by finding the principal components, which are directions where the data varies the most. In other words:
- It’s like pointing a flashlight at your data cloud and projecting its shadow onto a lower-dimensional surface. 🌟
How PCA Works: The Math Journey
Step 1: Mean Center the Data
Shift the data so its mean is zero. For a dataset \( X \):
\[ X_{\text{centered}} = X - \mu \]Where \( \mu \) is the mean of the dataset.
Step 2: Compute the Covariance Matrix
The covariance matrix captures how features vary together:
\[ \Sigma = \frac{1}{n} X_{\text{centered}}^T X_{\text{centered}} \]This matrix tells us which directions the data is most spread out.
Step 3: Eigen Decomposition
Find the eigenvectors and eigenvalues of the covariance matrix:
\[ \Sigma v = \lambda v \]Where:
- \( v \): Eigenvector (direction of maximum variance).
- \( \lambda \): Eigenvalue (magnitude of variance along \( v \)).
Step 4: Choose Principal Components
Select the top \( k \) eigenvectors corresponding to the largest eigenvalues. These eigenvectors are the principal components.
Step 5: Project Data
Project the data onto the new lower-dimensional subspace:
\[ X_{\text{reduced}} = X_{\text{centered}} \cdot V_k \]Where \( V_k \) contains the top \( k \) eigenvectors.
Numerical Example
Suppose we have a simple 2D dataset:
\[ X = \begin{bmatrix} 2 & 4 \\ 3 & 5 \\ 4 & 6 \\ 5 & 7 \end{bmatrix} \]-
Mean Center the Data:
- Compute \( \mu = [3.5, 5.5] \).
- Subtract \( \mu \): \[ X_{\text{centered}} = \begin{bmatrix} -1.5 & -1.5 \\ -0.5 & -0.5 \\ 0.5 & 0.5 \\ 1.5 & 1.5 \end{bmatrix} \]
-
Covariance Matrix:
\[ \Sigma = \frac{1}{4} X_{\text{centered}}^T X_{\text{centered}} = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} \] -
Eigen Decomposition:
- Eigenvalues: \( \lambda_1 = 2, \lambda_2 = 0 \).
- Eigenvectors: \( v_1 = [1, 1]^T, v_2 = [-1, 1]^T \).
-
Project Data:
- Use \( v_1 = [1, 1]^T \) to project data onto 1D: \[ X_{\text{reduced}} = X_{\text{centered}} \cdot v_1 = \begin{bmatrix} -3 \\ -1 \\ 1 \\ 3 \end{bmatrix} \]
Why PCA Is Awesome
- Simplifies Data:
- Reduces dimensions while preserving variability.
- De-noises Data:
- Eliminates small, irrelevant variations.
- Visualization:
- Great for plotting high-dimensional data in 2D or 3D.
Challenges of PCA
- Linear Assumption:
- PCA assumes data is linearly separable.
- Sensitive to Scaling:
- Features with different scales can skew results.
- Solution: Standardize the data before PCA.
Code Example: PCA in Python
Here’s how to perform PCA using NumPy:
import numpy as np
from sklearn.decomposition import PCA
# Example data
X = np.array([[2, 4], [3, 5], [4, 6], [5, 7]])
# Perform PCA using NumPy
X_centered = X - np.mean(X, axis=0)
cov_matrix = np.cov(X_centered, rowvar=False)
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Project onto the first principal component
principal_component = eigenvectors[:, 0]
X_reduced = np.dot(X_centered, principal_component)
print("Reduced Data:\n", X_reduced)
# Alternatively, use sklearn
pca = PCA(n_components=1)
X_sklearn_reduced = pca.fit_transform(X)
print("Reduced Data (sklearn):\n", X_sklearn_reduced)
Fun Analogy
PCA is like squishing a watermelon 🍉 into a juice box:
- The watermelon (original data) has all the information.
- You find the juiciest parts (principal components).
- You throw away the rind (low-variance features).
- The juice box (reduced data) is compact but still delicious! 🧃
Mermaid.js Diagram: PCA Workflow
graph TD InputData[Input Data] --> Center["Center Data (Subtract Mean)"] Center --> Covariance[Compute Covariance Matrix] Covariance --> EigenDecomposition["Eigen Decomposition (Find Eigenvectors/Values)"] EigenDecomposition --> SelectPCs[Select Top Principal Components] SelectPCs --> Project[Project Data Onto PCs] Project --> OutputData[Reduced Data]
2. t-SNE and UMAP: The Wizards of Non-Linear Dimensionality Reduction 🧙♂️✨
What Are t-SNE and UMAP?
t-SNE:
- Focuses on local structure by preserving pairwise similarities.
- Great for visualizing high-dimensional data in 2D or 3D.
- Think of it as compressing a map 🌍 but keeping cities close to their neighbors.
UMAP:
- Similar to t-SNE but faster and better at preserving global structure.
- It’s like t-SNE’s speedier, more precise sibling. 🚀
1. t-SNE: Trusting Your Neighbors in Data 👫
t-SNE tries to keep similar points close and dissimilar points far apart by creating two probability distributions:
-
High-Dimensional Space:
- Compute the probability of similarity between points \( i \) and \( j \) using a Gaussian kernel: \[ P_{ij} = \frac{\exp(-\|x_i - x_j\|^2 / 2 \sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2 \sigma_i^2)} \]
-
Low-Dimensional Space:
- Define another similarity probability \( Q_{ij} \) using a Student-t distribution: \[ Q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}} \]
-
KL Divergence:
- Minimize the difference between \( P_{ij} \) and \( Q_{ij} \): \[ \mathcal{L} = \sum_{i \neq j} P_{ij} \log \frac{P_{ij}}{Q_{ij}} \]
This creates a low-dimensional embedding where similar points in high-dimensional space remain close.
Why t-SNE Rocks
- Intuitive Clustering:
- Groups similar points together.
- Beautiful Visualizations:
- Creates visually appealing 2D/3D representations.
Limitations:
- Computationally expensive.
- Struggles with preserving global structure.
2. UMAP: The Faster Magician 🎩
UMAP builds on t-SNE’s ideas but uses:
-
Fuzzy Set Theory:
- Constructs a fuzzy graph of the high-dimensional data.
-
Optimization:
- Learns a low-dimensional embedding by minimizing the cross-entropy between fuzzy sets.
The UMAP Math (Simplified):
-
High-Dimensional Graph:
- Define a graph where edges represent probabilities of connectivity.
-
Low-Dimensional Graph:
- Optimize the layout of a lower-dimensional graph to match the structure.
Why UMAP Shines
- Speed:
- Much faster than t-SNE, especially for large datasets.
- Preserves Global Structure:
- Maintains relationships between clusters.
Numerical Example: t-SNE in Action
Let’s visualize a dataset with three clusters.
-
High-Dimensional Data:
\[ X = \text{3 clusters of points in 50 dimensions.} \] -
Compute Probabilities:
- \( P_{ij} \): Similarity in high dimensions.
- \( Q_{ij} \): Similarity in low dimensions.
-
KL Divergence:
- Minimize \( \mathcal{L} \) to adjust low-dimensional points.
Result: A 2D scatter plot where clusters are clearly separated.
Code Example: t-SNE and UMAP
Here’s how to visualize data using t-SNE and UMAP in Python:
from sklearn.manifold import TSNE
import umap.umap_ as umap
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# Create synthetic data
X, y = make_blobs(n_samples=500, n_features=50, centers=3, random_state=42)
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
# Apply UMAP
umap_model = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(X)
# Plot
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
plt.title("t-SNE")
plt.subplot(1, 2, 2)
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis')
plt.title("UMAP")
plt.show()
Fun Analogy
t-SNE and UMAP are like Google Maps vs. Waze 🚗:
- t-SNE: Carefully calculates local routes, ensuring neighborhoods are preserved. 🌆
- UMAP: Balances local routes and highways, giving you a broader and faster perspective. 🌍
Mermaid.js Diagram: t-SNE and UMAP Workflow
graph TD HighDim[High-Dimensional Data] --> Similarity[Pij: High-Dimensional Similarity] Similarity --> LowDim[Optimize Low-Dimensional Space] LowDim --> tSNE[t-SNE] LowDim --> UMAP[UMAP] tSNE --> Visualization1[Visualize Clusters] UMAP --> Visualization2[Visualize Clusters]
3. Autoencoders for Compression: The Marie Kondos of Data 🧳✨
What Are Autoencoders?
An Autoencoder is a type of neural network that learns to compress data into a smaller representation (latent space) and then reconstruct it back. Think of it as a zip-unzip mechanism for data:
- Encoder: Squishes data into a compact representation.
- Decoder: Reconstructs the original data from this representation.
How Autoencoders Work: The Neural Workflow
Autoencoders have two main parts:
-
Encoder:
- Maps the input \( X \) into a lower-dimensional latent representation \( Z \): \[ Z = f_\theta(X) \]
- Example: A 1000-dimensional image becomes a 50-dimensional vector.
-
Decoder:
- Reconstructs \( X' \) from \( Z \): \[ X' = g_\phi(Z) \]
-
Loss Function:
- Measures how close \( X' \) (reconstructed data) is to \( X \) (original data): \[ \mathcal{L} = \|X - X'\|^2 \]
Types of Autoencoders
1. Basic Autoencoder
The classic version with fully connected layers, used for straightforward dimensionality reduction.
2. Convolutional Autoencoder
Specialized for image data, using convolutional layers for spatial understanding.
3. Variational Autoencoder (VAE)
Adds a probabilistic twist to the latent space, ensuring smoothness for generative tasks.
Numerical Example: Compressing a 2D Dataset
Imagine a 2D dataset:
\[ X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \]Step 1: Encoder
The encoder reduces the data to 1D using weights \( W \) and biases \( b \):
\[ Z = X \cdot W + b \]Suppose:
\[ W = \begin{bmatrix} 0.5 \\ 0.5 \end{bmatrix}, \quad b = 0 \]Then:
\[ Z = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \\ 0.5 \end{bmatrix} = \begin{bmatrix} 1.5 \\ 3.5 \\ 5.5 \end{bmatrix} \]Step 2: Decoder
The decoder reconstructs \( X' \) using weights \( W' \):
\[ X' = Z \cdot W'^T \]Suppose:
\[ W' = \begin{bmatrix} 0.5 & 0.5 \end{bmatrix} \]Then:
\[ X' = \begin{bmatrix} 1.5 \\ 3.5 \\ 5.5 \end{bmatrix} \cdot \begin{bmatrix} 0.5 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.75 & 0.75 \\ 1.75 & 1.75 \\ 2.75 & 2.75 \end{bmatrix} \]Step 3: Loss
Compute the reconstruction loss:
\[ \mathcal{L} = \|X - X'\|^2 \]Why Autoencoders Are Cool
- Data Compression:
- Reduces high-dimensional data to compact representations.
- Noise Reduction:
- Cleans noisy data by reconstructing the clean version.
- Feature Extraction:
- Learns meaningful latent features for downstream tasks.
Challenges of Autoencoders
-
Overfitting:
- The network may memorize instead of generalizing.
- Solution: Add regularization or use sparse autoencoders.
-
Non-Interpretable Latent Space:
- Latent features aren’t always intuitive.
- Solution: Use Variational Autoencoders (VAEs).
Code Example: Autoencoder in PyTorch
Here’s a simple Autoencoder for dimensionality reduction:
import torch
import torch.nn as nn
import torch.optim as optim
# Define Autoencoder
class Autoencoder(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Linear(input_dim, latent_dim)
self.decoder = nn.Linear(latent_dim, input_dim)
def forward(self, x):
z = self.encoder(x)
x_reconstructed = self.decoder(z)
return x_reconstructed
# Example data
input_dim = 4
latent_dim = 2
model = Autoencoder(input_dim, latent_dim)
# Example input
data = torch.tensor([[1.0, 2.0, 3.0, 4.0]])
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Training loop (1 epoch for simplicity)
model.train()
optimizer.zero_grad()
reconstructed = model(data)
loss = criterion(reconstructed, data)
loss.backward()
optimizer.step()
print("Reconstructed Data:", reconstructed.detach().numpy())
print("Loss:", loss.item())
Fun Analogy
An Autoencoder is like packing a suitcase 🧳:
- Encoder: Carefully folds your clothes (compresses data).
- Latent Space: The compact suitcase holds only the essentials.
- Decoder: Unpacks the suitcase when you reach your destination (reconstructs data). 🎒
Mermaid.js Diagram: Autoencoder Workflow
graph TD Input[Input Data X] --> Encoder[Encoder: Reduce Dimensions] Encoder --> LatentSpace[Latent Space Z] LatentSpace --> Decoder[Decoder: Reconstruct Data] Decoder --> Output[Reconstructed Data X'] Output --> Loss[Compute Reconstruction Loss]