Transformers and Attention Mechanisms: Unveiling the Math Behind Modern NLP Models
Raj Shaikh 8 min read 1645 words1. Dot-Product Attention: AI’s Laser Focus 🎯
What is Attention?
Attention is the mechanism that helps neural networks decide which parts of the input matter the most. It’s like saying, “Don’t just listen to the whole song, focus on the lyrics that scream heartbreak!” 🎤💔
Dot-Product Attention: The Math
At its heart, attention is about scoring how much one word (or token) relates to another. The score is calculated using the dot product of two vectors:
- Query (\( Q \)): What are we looking for?
- Key (\( K \)): What do we have in the dataset?
- Value (\( V \)): What information do we retrieve if there’s a match?
Steps:
-
Calculate Scores:
\[ \text{Score}(Q, K) = Q \cdot K^T \] -
Scale the Scores:
\[ \text{Scaled Score} = \frac{\text{Score}(Q, K)}{\sqrt{d_k}} \]Where \( d_k \) is the dimensionality of the keys. Scaling prevents large dot-product values from dominating the softmax output.
-
Apply Softmax:
\[ \text{Attention Weights} = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \]This converts scores into probabilities.
-
Weighted Sum: Use the attention weights to compute the final output:
\[ \text{Output} = \text{Attention Weights} \cdot V \]
Numerical Example
Let’s calculate attention for 3 tokens with \( d_k = 2 \):
Token | Query (\( Q \)) | Key (\( K \)) | Value (\( V \)) |
---|---|---|---|
1 | [1, 0] | [1, 1] | [10, 0] |
2 | [0, 1] | [1, 0] | [0, 10] |
3 | [1, 1] | [0, 1] | [5, 5] |
-
Dot Product (Score): For \( Q_1 \) and \( K \):
\[ Q_1 \cdot K^T = \begin{bmatrix} 1 & 0 \end{bmatrix} \cdot \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 \end{bmatrix} \] -
Scale:
\[ \text{Scaled Score} = \frac{\text{Score}}{\sqrt{2}} = \begin{bmatrix} \frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0 \end{bmatrix} \] -
Softmax:
\[ \text{Attention Weights} = \text{softmax}\left(\begin{bmatrix} \frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0 \end{bmatrix}\right) = [0.42, 0.42, 0.16] \] -
Weighted Sum: Use these weights to compute the output:
\[ \text{Output} = \text{Attention Weights} \cdot V = [0.42, 0.42, 0.16] \cdot \begin{bmatrix} 10 & 0 \\ 0 & 10 \\ 5 & 5 \end{bmatrix} = [5.3, 4.2] \]
Why Dot-Product Attention Matters
Dot-product attention is the core mechanism in:
- Transformers: Helps process relationships between words in a sentence, regardless of distance.
- Generative Models: Focuses on relevant information to create meaningful outputs.
Code Example: Dot-Product Attention
Here’s how to compute dot-product attention in Python:
import numpy as np
# Queries (Q), Keys (K), and Values (V)
Q = np.array([[1, 0], [0, 1], [1, 1]]) # 3 queries
K = np.array([[1, 1], [1, 0], [0, 1]]) # 3 keys
V = np.array([[10, 0], [0, 10], [5, 5]]) # 3 values
# Compute dot products
scores = np.dot(Q, K.T)
# Scale scores
d_k = Q.shape[1]
scaled_scores = scores / np.sqrt(d_k)
# Apply softmax
attention_weights = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores), axis=1, keepdims=True)
# Compute output
output = np.dot(attention_weights, V)
print("Attention Weights:\n", attention_weights)
print("Output:\n", output)
Fun Analogy
Dot-product attention is like shopping for groceries 🛒:
- You have a query (what you need, like “milk”).
- The store has keys (labels on items).
- When the query matches a key, you grab the value (the actual milk carton). 🥛
Mermaid.js Diagram: Dot-Product Attention Flow
graph TD Queries[Queries Q] --> Scores[Compute Dot Product with Keys K] Scores --> Scale[Scale by sqrt d_k] Scale --> Softmax[Apply Softmax to Get Weights] Softmax --> WeightedSum[Weighted Sum with Values V] WeightedSum --> Output[Attention Output]
2. Self-Attention Matrix Calculations: When Tokens Gossip 🗣️🤔
What is Self-Attention?
Self-Attention allows every token (word, character, etc.) in a sequence to “pay attention” to all other tokens, including itself. This helps the model understand relationships and context.
Example:
In the sentence “She gave her dog a bone,” who does “her” refer to? Self-Attention helps resolve this ambiguity by analyzing the entire sentence.
How Self-Attention Works
Self-Attention uses the Queries (\( Q \)), Keys (\( K \)), and Values (\( V \)) we discussed earlier. Here’s the workflow:
-
Compute Scores:
\[ \text{Scores} = Q \cdot K^T \]This gives a matrix showing how much each token relates to every other token.
-
Scale the Scores:
\[ \text{Scaled Scores} = \frac{\text{Scores}}{\sqrt{d_k}} \]Prevents large dot-product values from messing up the softmax.
-
Apply Softmax:
\[ \text{Attention Weights} = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \]Converts the scores into probabilities.
-
Compute Weighted Sum: Multiply the attention weights by the values:
\[ \text{Output} = \text{Attention Weights} \cdot V \]
Numerical Example: Self-Attention
Let’s compute self-attention for 3 tokens with \( d_k = 2 \):
Token | Query (\( Q \)) | Key (\( K \)) | Value (\( V \)) |
---|---|---|---|
1 | [1, 0] | [1, 1] | [10, 0] |
2 | [0, 1] | [1, 0] | [0, 10] |
3 | [1, 1] | [0, 1] | [5, 5] |
Step 1: Compute Scores
For \( Q \cdot K^T \):
\[ \text{Scores} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix} \cdot \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}^T \]\[ =\begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ 2 & 1 & 1 \end{bmatrix} \]Step 2: Scale the Scores
Scale by \( \sqrt{d_k} = \sqrt{2} \):
\[ \text{Scaled Scores} = \frac{\text{Scores}}{\sqrt{2}} = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 0 \\ \frac{1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} \\ \frac{2}{\sqrt{2}} & \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{bmatrix} \]Step 3: Apply Softmax
For each row, apply softmax:
\[ \text{Attention Weights} = \text{softmax}\left(\text{Scaled Scores}\right) \]For row 1:
\[ \text{softmax}(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}, 0) = \begin{bmatrix} 0.4, 0.4, 0.2 \end{bmatrix} \]Step 4: Compute Weighted Sum
Multiply attention weights by \( V \):
\[ \text{Output} = \text{Attention Weights} \cdot V \]For row 1:
\[ [0.4, 0.4, 0.2] \cdot \begin{bmatrix} 10, 0 \\ 0, 10 \\ 5, 5 \end{bmatrix} = [5, 4] \]Repeat for other rows. 🎉
Why Self-Attention Matters
- Contextual Understanding:
- Each token “talks” to others, capturing relationships across the input.
- Parallel Processing:
- Unlike RNNs, Transformers process all tokens at once.
- Versatility:
- Powers tasks like translation, summarization, and text generation.
Code Example: Self-Attention in Python
Here’s how to compute self-attention:
import numpy as np
# Define Q, K, and V matrices
Q = np.array([[1, 0], [0, 1], [1, 1]]) # Queries
K = np.array([[1, 1], [1, 0], [0, 1]]) # Keys
V = np.array([[10, 0], [0, 10], [5, 5]]) # Values
# Compute Scores
scores = np.dot(Q, K.T)
# Scale Scores
d_k = Q.shape[1]
scaled_scores = scores / np.sqrt(d_k)
# Apply Softmax
softmax = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores), axis=1, keepdims=True)
# Compute Output
output = np.dot(softmax, V)
print("Self-Attention Weights:\n", softmax)
print("Output:\n", output)
Fun Analogy
Self-Attention is like a team meeting 🧑🤝🧑:
- Every teammate shares their perspective (Keys).
- Everyone listens carefully to others (Queries).
- Decisions are made based on weighted input (Values). 🧠
Mermaid.js Diagram: Self-Attention Flow
graph TD Tokens[Input Tokens] --> QKV[Compute Q, K, V Matrices] QKV --> Scores[Compute Q * K^T] Scores --> Scale[Scale by sqrt d_k] Scale --> Softmax[Apply Softmax] Softmax --> WeightedSum[Compute Weighted Sum with V] WeightedSum --> Output[Final Self-Attention Output]
3. Positional Encoding: Teaching Transformers the Art of Sequence 🎨📏
What is Positional Encoding?
Positional Encoding helps Transformers understand the order of tokens in a sequence. Without it, “I love AI” and “AI love I” would mean the same thing (and no one loves that kind of chaos). 😅
It assigns each token a unique position-based embedding, which gets added to the word embedding. This way, the model knows where each word belongs in the sequence.
How Positional Encoding Works
Positional Encoding uses sinusoidal functions to encode the position of tokens. Why sinusoidal? Because it allows models to generalize to unseen sequence lengths (fancy math magic at its best).
For a token at position \( pos \) in the sequence, the encoding for the \( i \)-th dimension is:
\[ PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \]\[ PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \]Where:
- \( pos \): Token position.
- \( i \): Dimension index.
- \( d_{\text{model}} \): Dimension of the embedding.
Why Sinusoids?
- Periodicity: Helps encode relative positions (e.g., “this token is closer to that one”).
- Extrapolation: Works for sequences longer than those seen during training.
Numerical Example
Let’s compute positional encodings for two tokens (\( pos = 0 \) and \( pos = 1 \)) with \( d_{\text{model}} = 4 \):
-
For \( i = 0, 1 \):
\[ PE(0, 2i) = \sin(0) = 0, \quad PE(0, 2i+1) = \cos(0) = 1 \]\[ PE(1, 2i) = \sin\left(\frac{1}{10000^{\frac{2i}{4}}}\right), \quad PE(1, 2i+1) = \cos\left(\frac{1}{10000^{\frac{2i}{4}}}\right) \] -
Simplified:
- \( PE(1, 0) = \sin(1) \approx 0.841 \)
- \( PE(1, 1) = \cos(1) \approx 0.540 \)
- \( PE(1, 2) = \sin(0.0001) \approx 0.0001 \)
- \( PE(1, 3) = \cos(0.0001) \approx 1.0000 \)
Adding Positional Encoding
Positional encodings are added to the word embeddings:
\[ \text{Embedding + Positional Encoding} = \text{Final Input to the Transformer} \]Example:
- Word Embedding: \([0.5, 0.2, 0.3, 0.1]\)
- Positional Encoding: \([0.0, 1.0, 0.841, 0.540]\)
- Final Input: \([0.5, 1.2, 1.141, 0.640]\)
Why Positional Encoding Matters
- Order Awareness:
- Transformers process sequences in parallel, so they need a way to know the order of tokens.
- Long-Sequence Generalization:
- Sinusoids enable Transformers to handle longer sequences without retraining.
Code Example: Positional Encoding
Here’s how to implement Positional Encoding in Python:
import numpy as np
def positional_encoding(max_position, d_model):
pos_enc = np.zeros((max_position, d_model))
for pos in range(max_position):
for i in range(0, d_model, 2):
pos_enc[pos, i] = np.sin(pos / (10000 ** (2 * i / d_model)))
pos_enc[pos, i + 1] = np.cos(pos / (10000 ** (2 * i / d_model)))
return pos_enc
# Example usage
max_position = 5 # Sequence length
d_model = 4 # Embedding dimension
pos_enc = positional_encoding(max_position, d_model)
print("Positional Encoding:\n", pos_enc)
Fun Analogy
Positional Encoding is like assigning seats in a theater 🎭:
- Each token gets a unique “seat number” (positional encoding).
- This ensures the Transformer doesn’t mix up who’s saying what in the script.
Mermaid.js Diagram: Positional Encoding Flow
graph TD Tokens[Input Tokens] --> WordEmbedding[Generate Word Embeddings] Tokens --> Position[Assign Positions] Position --> SinCos[Compute Sin/Cos Values] SinCos --> PosEnc[Create Positional Encodings] WordEmbedding --> Combine[Add Word Embeddings and Positional Encodings] PosEnc --> Combine Combine --> FinalInput[Final Input to Transformer]