Transformers and Attention Mechanisms: Unveiling the Math Behind Modern NLP Models

Raj Shaikh 8 min read 1645 words

1. Dot-Product Attention: AI’s Laser Focus 🎯

What is Attention?

Attention is the mechanism that helps neural networks decide which parts of the input matter the most. It’s like saying, “Don’t just listen to the whole song, focus on the lyrics that scream heartbreak!” 🎤💔

Dot-Product Attention: The Math

At its heart, attention is about scoring how much one word (or token) relates to another. The score is calculated using the dot product of two vectors:

Query (\( Q \)): What are we looking for?
Key (\( K \)): What do we have in the dataset?
Value (\( V \)): What information do we retrieve if there’s a match?

Steps:

Calculate Scores:
\[ \text{Score}(Q, K) = Q \cdot K^T \]
Scale the Scores:
\[ \text{Scaled Score} = \frac{\text{Score}(Q, K)}{\sqrt{d_k}} \]
Where \( d_k \) is the dimensionality of the keys. Scaling prevents large dot-product values from dominating the softmax output.
Apply Softmax:
\[ \text{Attention Weights} = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \]
This converts scores into probabilities.
Weighted Sum: Use the attention weights to compute the final output:
\[ \text{Output} = \text{Attention Weights} \cdot V \]

Numerical Example

Let’s calculate attention for 3 tokens with \( d_k = 2 \):

Token	Query (\( Q \))	Key (\( K \))	Value (\( V \))
1	[1, 0]	[1, 1]	[10, 0]
2	[0, 1]	[1, 0]	[0, 10]
3	[1, 1]	[0, 1]	[5, 5]

Dot Product (Score): For \( Q_1 \) and \( K \):
\[ Q_1 \cdot K^T = \begin{bmatrix} 1 & 0 \end{bmatrix} \cdot \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 \end{bmatrix} \]
Scale:
\[ \text{Scaled Score} = \frac{\text{Score}}{\sqrt{2}} = \begin{bmatrix} \frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0 \end{bmatrix} \]
Softmax:
\[ \text{Attention Weights} = \text{softmax}\left(\begin{bmatrix} \frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0 \end{bmatrix}\right) = [0.42, 0.42, 0.16] \]
Weighted Sum: Use these weights to compute the output:
\[ \text{Output} = \text{Attention Weights} \cdot V = [0.42, 0.42, 0.16] \cdot \begin{bmatrix} 10 & 0 \\ 0 & 10 \\ 5 & 5 \end{bmatrix} = [5.3, 4.2] \]

Why Dot-Product Attention Matters

Dot-product attention is the core mechanism in:

Transformers: Helps process relationships between words in a sentence, regardless of distance.
Generative Models: Focuses on relevant information to create meaningful outputs.

Code Example: Dot-Product Attention

Here’s how to compute dot-product attention in Python:

import numpy as np

# Queries (Q), Keys (K), and Values (V)
Q = np.array([[1, 0], [0, 1], [1, 1]])  # 3 queries
K = np.array([[1, 1], [1, 0], [0, 1]])  # 3 keys
V = np.array([[10, 0], [0, 10], [5, 5]])  # 3 values

# Compute dot products
scores = np.dot(Q, K.T)

# Scale scores
d_k = Q.shape[1]
scaled_scores = scores / np.sqrt(d_k)

# Apply softmax
attention_weights = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores), axis=1, keepdims=True)

# Compute output
output = np.dot(attention_weights, V)

print("Attention Weights:\n", attention_weights)
print("Output:\n", output)

Fun Analogy

Dot-product attention is like shopping for groceries 🛒:

You have a query (what you need, like “milk”).
The store has keys (labels on items).
When the query matches a key, you grab the value (the actual milk carton). 🥛

Mermaid.js Diagram: Dot-Product Attention Flow

graph TD
    Queries[Queries Q] --> Scores[Compute Dot Product with Keys K]
    Scores --> Scale[Scale by sqrt d_k]
    Scale --> Softmax[Apply Softmax to Get Weights]
    Softmax --> WeightedSum[Weighted Sum with Values V]
    WeightedSum --> Output[Attention Output]

2. Self-Attention Matrix Calculations: When Tokens Gossip 🗣️🤔

What is Self-Attention?

Self-Attention allows every token (word, character, etc.) in a sequence to “pay attention” to all other tokens, including itself. This helps the model understand relationships and context.

Example:

In the sentence “She gave her dog a bone,” who does “her” refer to? Self-Attention helps resolve this ambiguity by analyzing the entire sentence.

How Self-Attention Works

Self-Attention uses the Queries (\( Q \)), Keys (\( K \)), and Values (\( V \)) we discussed earlier. Here’s the workflow:

Compute Scores:
\[ \text{Scores} = Q \cdot K^T \]
This gives a matrix showing how much each token relates to every other token.
Scale the Scores:
\[ \text{Scaled Scores} = \frac{\text{Scores}}{\sqrt{d_k}} \]
Prevents large dot-product values from messing up the softmax.
Apply Softmax:
\[ \text{Attention Weights} = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \]
Converts the scores into probabilities.
Compute Weighted Sum: Multiply the attention weights by the values:
\[ \text{Output} = \text{Attention Weights} \cdot V \]

Numerical Example: Self-Attention

Let’s compute self-attention for 3 tokens with \( d_k = 2 \):

Token	Query (\( Q \))	Key (\( K \))	Value (\( V \))
1	[1, 0]	[1, 1]	[10, 0]
2	[0, 1]	[1, 0]	[0, 10]
3	[1, 1]	[0, 1]	[5, 5]

Step 1: Compute Scores

For \( Q \cdot K^T \):

\[ \text{Scores} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \cdot \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}^T \]\[ =\begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ 2 & 1 & 1 \end{bmatrix} \]

Step 2: Scale the Scores

Scale by \( \sqrt{d_k} = \sqrt{2} \):

\[ \text{Scaled Scores} = \frac{\text{Scores}}{\sqrt{2}} = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 0 \\ \frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} \\ \frac{2}{\sqrt{2}} & \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{bmatrix} \]

Step 3: Apply Softmax

For each row, apply softmax:

\[ \text{Attention Weights} = \text{softmax}\left(\text{Scaled Scores}\right) \]

For row 1:

\[ \text{softmax}(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0) = \begin{bmatrix} 0.4, 0.4, 0.2 \end{bmatrix} \]

Step 4: Compute Weighted Sum

Multiply attention weights by \( V \):

\[ \text{Output} = \text{Attention Weights} \cdot V \]

For row 1:

\[ [0.4, 0.4, 0.2] \cdot \begin{bmatrix} 10, 0 \\ 0, 10 \\ 5, 5 \end{bmatrix} = [5, 4] \]

Repeat for other rows. 🎉

Why Self-Attention Matters

Contextual Understanding:
- Each token “talks” to others, capturing relationships across the input.
Parallel Processing:
- Unlike RNNs, Transformers process all tokens at once.
Versatility:
- Powers tasks like translation, summarization, and text generation.

Code Example: Self-Attention in Python

Here’s how to compute self-attention:

import numpy as np

# Define Q, K, and V matrices
Q = np.array([[1, 0], [0, 1], [1, 1]])  # Queries
K = np.array([[1, 1], [1, 0], [0, 1]])  # Keys
V = np.array([[10, 0], [0, 10], [5, 5]])  # Values

# Compute Scores
scores = np.dot(Q, K.T)

# Scale Scores
d_k = Q.shape[1]
scaled_scores = scores / np.sqrt(d_k)

# Apply Softmax
softmax = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores), axis=1, keepdims=True)

# Compute Output
output = np.dot(softmax, V)

print("Self-Attention Weights:\n", softmax)
print("Output:\n", output)

Fun Analogy

Self-Attention is like a team meeting 🧑‍🤝‍🧑:

Every teammate shares their perspective (Keys).
Everyone listens carefully to others (Queries).
Decisions are made based on weighted input (Values). 🧠

Mermaid.js Diagram: Self-Attention Flow

graph TD
    Tokens[Input Tokens] --> QKV[Compute Q, K, V Matrices]
    QKV --> Scores[Compute Q * K^T]
    Scores --> Scale[Scale by sqrt d_k]
    Scale --> Softmax[Apply Softmax]
    Softmax --> WeightedSum[Compute Weighted Sum with V]
    WeightedSum --> Output[Final Self-Attention Output]

3. Positional Encoding: Teaching Transformers the Art of Sequence 🎨📏

What is Positional Encoding?

Positional Encoding helps Transformers understand the order of tokens in a sequence. Without it, “I love AI” and “AI love I” would mean the same thing (and no one loves that kind of chaos). 😅

It assigns each token a unique position-based embedding, which gets added to the word embedding. This way, the model knows where each word belongs in the sequence.

How Positional Encoding Works

Positional Encoding uses sinusoidal functions to encode the position of tokens. Why sinusoidal? Because it allows models to generalize to unseen sequence lengths (fancy math magic at its best).

For a token at position \( pos \) in the sequence, the encoding for the \( i \)-th dimension is:

\[ PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \]\[ PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \]

Where:

\( pos \): Token position.
\( i \): Dimension index.
\( d_{\text{model}} \): Dimension of the embedding.

Why Sinusoids?

Periodicity: Helps encode relative positions (e.g., “this token is closer to that one”).
Extrapolation: Works for sequences longer than those seen during training.

Numerical Example

Let’s compute positional encodings for two tokens (\( pos = 0 \) and \( pos = 1 \)) with \( d_{\text{model}} = 4 \):

For \( i = 0, 1 \):
\[ PE(0, 2i) = \sin(0) = 0, \quad PE(0, 2i+1) = \cos(0) = 1 \]\[ PE(1, 2i) = \sin\left(\frac{1}{10000^{\frac{2i}{4}}}\right), \quad PE(1, 2i+1) = \cos\left(\frac{1}{10000^{\frac{2i}{4}}}\right) \]
Simplified:
- \( PE(1, 0) = \sin(1) \approx 0.841 \)
- \( PE(1, 1) = \cos(1) \approx 0.540 \)
- \( PE(1, 2) = \sin(0.0001) \approx 0.0001 \)
- \( PE(1, 3) = \cos(0.0001) \approx 1.0000 \)

Adding Positional Encoding

Positional encodings are added to the word embeddings:

\[ \text{Embedding + Positional Encoding} = \text{Final Input to the Transformer} \]

Example:

Word Embedding: \([0.5, 0.2, 0.3, 0.1]\)
Positional Encoding: \([0.0, 1.0, 0.841, 0.540]\)
Final Input: \([0.5, 1.2, 1.141, 0.640]\)

Why Positional Encoding Matters

Order Awareness:
- Transformers process sequences in parallel, so they need a way to know the order of tokens.
Long-Sequence Generalization:
- Sinusoids enable Transformers to handle longer sequences without retraining.

Code Example: Positional Encoding

Here’s how to implement Positional Encoding in Python:

import numpy as np

def positional_encoding(max_position, d_model):
    pos_enc = np.zeros((max_position, d_model))
    for pos in range(max_position):
        for i in range(0, d_model, 2):
            pos_enc[pos, i] = np.sin(pos / (10000 ** (2 * i / d_model)))
            pos_enc[pos, i + 1] = np.cos(pos / (10000 ** (2 * i / d_model)))
    return pos_enc

# Example usage
max_position = 5  # Sequence length
d_model = 4       # Embedding dimension
pos_enc = positional_encoding(max_position, d_model)
print("Positional Encoding:\n", pos_enc)

Fun Analogy

Positional Encoding is like assigning seats in a theater 🎭:

Each token gets a unique “seat number” (positional encoding).
This ensures the Transformer doesn’t mix up who’s saying what in the script.

Mermaid.js Diagram: Positional Encoding Flow

graph TD
    Tokens[Input Tokens] --> WordEmbedding[Generate Word Embeddings]
    Tokens --> Position[Assign Positions]
    Position --> SinCos[Compute Sin/Cos Values]
    SinCos --> PosEnc[Create Positional Encodings]
    WordEmbedding --> Combine[Add Word Embeddings and Positional Encodings]
    PosEnc --> Combine
    Combine --> FinalInput[Final Input to Transformer]

Last updated on February 28, 2025

The Mathematical Foundations of AI: From Linear Algebra to Reinforcement Learning