Deep Learning Theory: Foundations and Applications

Raj Shaikh 24 min read 4911 words

1. Artificial Neural Networks (ANNs)

Artificial Neural Networks (ANNs) are a class of machine learning models inspired by the structure and functioning of the human brain. ANNs are designed to recognize patterns, make decisions, and predict outcomes by learning from data. They form the backbone of deep learning and are widely used in tasks like image recognition, natural language processing, and time-series forecasting.

Sub-Contents

Biological Inspiration
Structure of ANNs
Forward Propagation
Activation Functions
Loss Function
Backpropagation and Gradient Descent
Mathematical Representation of Training

Title: Artificial Neural Networks (ANNs)

Biological Inspiration The design of ANNs is inspired by the way neurons in the human brain work. A neuron receives signals, processes them, and transmits the output to other neurons. Similarly, an artificial neuron processes input data and passes it to subsequent layers in the network.

Structure of ANNs An ANN consists of:

Input Layer: Receives the input data.
Hidden Layers: Intermediate layers where computations occur.
Output Layer: Produces the final output.

Each layer comprises several nodes (neurons), and the connections between these nodes have weights that influence the learning process.

Forward Propagation Forward propagation is the process where input data passes through the network to produce an output.

Weighted Sum Calculation
Each neuron computes a weighted sum of its inputs:

\[ z = \sum_{i=1}^{n} w_i x_i + b \]

where:
- \( w_i \): Weight of the \(i\)-th input
- \( x_i \): \(i\)-th input value
- \( b \): Bias term
- \( z \): Weighted sum (pre-activation value)
Activation Function
The weighted sum is passed through an activation function to introduce non-linearity:

\[ a = f(z) \]

where \(f\) is the activation function, such as ReLU, sigmoid, or tanh.

Activation Functions Activation functions determine the output of a neuron. Common functions include:

Sigmoid:

\[ f(z) = \frac{1}{1 + e^{-z}} \]

Used for binary classification, outputs values between 0 and 1.
Tanh:

\[ f(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} \]

Outputs values between -1 and 1.
ReLU (Rectified Linear Unit):

\[ f(z) = \max(0, z) \]

Introduces sparsity in the network and speeds up convergence.

Loss Function The loss function quantifies the difference between predicted and actual outputs. For example:

Mean Squared Error (MSE):

\[ L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \]

where:
- \( y_i \): True value
- \( \hat{y}_i \): Predicted value
Cross-Entropy Loss (for classification):

\[ L = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] \]

Backpropagation and Gradient Descent Backpropagation adjusts the weights to minimize the loss using gradient descent. It works as follows:

Calculate Gradients
Compute the gradient of the loss \(L\) with respect to each weight \(w\):

\[ \frac{\partial L}{\partial w} \]
Update Weights
Adjust the weights using the gradients:

\[ w \leftarrow w - \eta \frac{\partial L}{\partial w} \]

where \(\eta\) is the learning rate.

Mathematics of Backpropagation:
For a single neuron:

\[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} \]

\(\frac{\partial L}{\partial a}\): Gradient of the loss w.r.t. the activation
\(\frac{\partial a}{\partial z}\): Gradient of the activation w.r.t. the weighted sum
\(\frac{\partial z}{\partial w}\): Gradient of the weighted sum w.r.t. the weight

Mathematical Representation of Training

Initialize Weights and Biases: Randomly initialize \(w\) and \(b\).
Forward Propagation: Compute predictions \(\hat{y}\).
Compute Loss: Evaluate the loss function \(L\).
Backpropagation: Calculate gradients of \(L\) w.r.t. \(w\) and \(b\).
Update Weights: Adjust weights and biases using gradient descent.
Iterate: Repeat until convergence or a specified number of epochs.

2. Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a specialized type of artificial neural network designed to process data with a grid-like topology, such as images. They are widely used in computer vision tasks like image classification, object detection, and segmentation. CNNs exploit spatial hierarchies in data, making them highly efficient for visual recognition tasks.

Sub-Contents

Introduction to CNNs
Convolution Operation
Filters/Kernels
Pooling Layers
Fully Connected Layers
Mathematical Formulation of CNN Training

Title: Convolutional Neural Networks (CNNs)

Introduction to CNNs CNNs mimic the way humans visually perceive the world by breaking down images into patterns and features. Instead of treating each pixel independently (as in traditional ANNs), CNNs capture spatial relationships between pixels to recognize edges, textures, and higher-level structures.

Convolution Operation The convolution operation is the core building block of CNNs. It involves sliding a filter (kernel) over an input image to extract features.

Mathematics of Convolution
For an input matrix \( I \) and a kernel \( K \), the convolution \( O(i, j) \) at position \((i, j)\) is given by:
\[ O(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I(i+m, j+n) \cdot K(m, n) \]
where:
- \( k \): Size of the kernel
- \( O(i, j) \): Output value at position \((i, j)\)
This operation highlights features like edges or textures by amplifying specific patterns in the image.

Filters/Kernels

Filters (Kernels): Small matrices (e.g., \(3 \times 3\)) that are used to detect specific features in the input, such as edges or textures.
Stride: The step size at which the filter moves across the image.
Padding: Adds borders to the input to control the spatial dimensions of the output. Types include:
- Valid Padding: No padding; reduces output size.
- Same Padding: Preserves the input size.

Pooling Layers Pooling layers reduce the spatial dimensions of feature maps, making computations faster and introducing translation invariance.

Max Pooling: Takes the maximum value in a region:
\[ P(i, j) = \max_{m, n} \{F(i+m, j+n)\} \]
where \( F(i, j) \) is the feature map.
Average Pooling: Takes the average value in a region:
\[ P(i, j) = \frac{1}{k^2} \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} F(i+m, j+n) \]

Fully Connected Layers After convolution and pooling layers, the output feature maps are flattened into a vector and passed to fully connected layers. These layers perform high-level reasoning for classification.

Weighted Sum:

\[ z = \sum_{i=1}^{n} w_i x_i + b \]
where \( w_i \) and \( b \) are weights and biases.
Activation Function: Applies non-linearity to the output.

Mathematical Formulation of CNN Training

Forward Propagation:
- Convolution extracts features.
- Pooling reduces spatial dimensions.
- Fully connected layers generate predictions.
Loss Function:
- For classification tasks, use Cross-Entropy Loss: \[ L = - \sum_{i=1}^{C} y_i \log(\hat{y}_i) \] where \( C \) is the number of classes, \( y_i \) is the true label, and \( \hat{y}_i \) is the predicted probability.
Backpropagation:
- Gradients are computed for convolution and pooling layers by propagating errors backward through the network.
- For convolutional layers: \[ \frac{\partial L}{\partial K(m, n)} = \sum_{i, j} \frac{\partial L}{\partial O(i, j)} \cdot I(i+m, j+n) \] where \( K(m, n) \) is the kernel.
Optimization:
- Weights and biases are updated using gradient descent: \[ w \leftarrow w - \eta \frac{\partial L}{\partial w} \] where \( \eta \) is the learning rate.

3. Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to process sequential data, such as time-series data, text, or audio. Unlike traditional feedforward networks, RNNs have a memory mechanism that allows them to capture dependencies across time or sequences. This makes RNNs particularly useful for tasks where context or order matters, such as language modeling, speech recognition, and forecasting.

Sub-Contents

Introduction to RNNs
Structure of an RNN
Mathematical Representation of RNNs
Backpropagation Through Time (BPTT)
Variants of RNNs (LSTM, GRU)

Title: Recurrent Neural Networks (RNNs)

Introduction to RNNs RNNs differ from feedforward networks by introducing loops in their architecture, enabling them to retain information from previous inputs. This sequential processing is critical for tasks where the meaning of the current input depends on previous inputs, such as predicting the next word in a sentence or the next value in a stock price series.

Structure of an RNN An RNN processes data step-by-step, maintaining a hidden state that captures the sequence’s historical context.

Inputs and Outputs:
- Input: A sequence of data \(\{x_1, x_2, \dots, x_T\}\)
- Output: A sequence of predictions \(\{y_1, y_2, \dots, y_T\}\) or a single prediction depending on the task.
Recurrent Structure:
At each time step \(t\), the RNN computes:
\[ h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b_h) \]
where:
- \(x_t\): Input at time \(t\)
- \(h_t\): Hidden state at time \(t\)
- \(W_{xh}\): Weight matrix from input to hidden state
- \(W_{hh}\): Weight matrix for the hidden-to-hidden connection
- \(b_h\): Bias term
- \(f\): Activation function (commonly tanh or ReLU)
The output is computed as:
\[ y_t = g(W_{hy} h_t + b_y) \]
where:
- \(W_{hy}\): Weight matrix from hidden state to output
- \(b_y\): Bias for the output layer
- \(g\): Activation function for the output layer (e.g., softmax for classification)

Mathematical Representation of RNNs The hidden state updates recursively:

\[ h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h) \]

This recursive definition allows RNNs to model dependencies across time.

For a sequence of length \(T\), the joint output is:

\[ Y = \{y_1, y_2, \dots, y_T\} \]

Backpropagation Through Time (BPTT) Training an RNN involves minimizing the loss function over the entire sequence. The process of backpropagating errors through time steps is known as Backpropagation Through Time (BPTT).

Loss Function:
For a sequence of length \(T\), the total loss is:
\[ L = \sum_{t=1}^{T} \mathcal{L}(y_t, \hat{y}_t) \]
where \(\mathcal{L}\) is the loss function (e.g., Cross-Entropy or Mean Squared Error).
Gradients Computation:
Gradients are computed for each parameter over all time steps:
\[ \frac{\partial L}{\partial W_{xh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{xh}} \]
Challenges in BPTT:
- Vanishing Gradients: Gradients shrink exponentially, making it difficult for the model to learn long-term dependencies.
- Exploding Gradients: Gradients grow uncontrollably, destabilizing the training.

Variants of RNNs To address the limitations of basic RNNs, advanced architectures like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) were developed.

LSTM (Long Short-Term Memory):
LSTMs use a gating mechanism to control the flow of information, enabling them to capture long-term dependencies effectively.
Key components:
- Forget gate (\(f_t\)): Decides what information to discard.
- Input gate (\(i_t\)): Decides what information to update.
- Output gate (\(o_t\)): Decides the final output.
Equations:
\[ f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f) \]\[ i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i) \]\[ o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o) \]\[ c_t = f_t \odot c_{t-1} + i_t \odot \tanh(W_c x_t + U_c h_{t-1} + b_c) \]\[ h_t = o_t \odot \tanh(c_t) \]
GRU (Gated Recurrent Unit):
A simplified version of LSTM with fewer parameters:
- Combines forget and input gates into an update gate.
- Simplifies the computation, making it computationally efficient.

4. Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) networks are a specialized form of Recurrent Neural Networks (RNNs) designed to address the limitations of traditional RNNs, particularly the vanishing gradient problem. LSTMs are capable of learning long-term dependencies in sequential data, making them highly effective for tasks like time-series forecasting, speech recognition, and natural language processing.

Sub-Contents

Introduction to LSTMs
Structure of an LSTM Cell
Mathematical Equations of LSTM
Key Components of LSTMs
Training LSTMs with Backpropagation Through Time (BPTT)

Title: Long Short-Term Memory (LSTM)

Introduction to LSTMs LSTMs are designed to retain information over long sequences. They use a gating mechanism to selectively remember or forget information, enabling them to model both short-term and long-term dependencies in data. This is achieved through the cell state, a memory structure that flows through the network, and gates that regulate the flow of information.

Structure of an LSTM Cell An LSTM cell consists of three main gates and a cell state:

Forget Gate: Decides what information to discard.
Input Gate: Decides what new information to store.
Output Gate: Decides what part of the cell state to output.

The cell state is the core of the LSTM, allowing information to flow relatively unchanged unless explicitly modified by the gates.

Mathematical Equations of LSTM At each time step \(t\), the LSTM cell processes the input \(x_t\), the previous hidden state \(h_{t-1}\), and the previous cell state \(c_{t-1}\). The following equations define its behavior:

Forget Gate:
\[ f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f) \]
Determines what information from \(c_{t-1}\) should be forgotten.
Input Gate:
\[ i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i) \]
Determines what information to add to the cell state.

Candidate values to update the cell state:
\[ \tilde{c}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c) \]
Cell State Update:
\[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \]
Combines the retained information from \(c_{t-1}\) and the new information \(\tilde{c}_t\).
Output Gate:
\[ o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o) \]
Determines what part of the cell state is used for the output.
Hidden State Update:
\[ h_t = o_t \odot \tanh(c_t) \]
Produces the hidden state \(h_t\), which is used as input for the next time step or as the output of the sequence.

Key Components of LSTMs

Gates:
- Forget Gate (\(f_t\)): Controls the flow of information from the previous cell state.
- Input Gate (\(i_t\)): Regulates the amount of new information added to the cell state.
- Output Gate (\(o_t\)): Controls the output to the next hidden state.
Cell State (\(c_t\)):
The memory component that carries forward relevant information over time.
Activation Functions:
- Sigmoid (\(\sigma\)) for gates, outputting values between 0 and 1 to scale the information flow.
- Tanh for cell state updates, outputting values between -1 and 1.

Training LSTMs with BPTT LSTMs are trained using Backpropagation Through Time (BPTT), which adjusts the weights to minimize the loss.

Loss Function:
For a sequence of length \(T\), the total loss is:
\[ L = \sum_{t=1}^{T} \mathcal{L}(y_t, \hat{y}_t) \]
where \(y_t\) is the true value, and \(\hat{y}_t\) is the predicted value.
Gradient Computation:
Gradients are computed for each weight matrix using the chain rule. LSTMs effectively mitigate vanishing gradients due to their gating mechanisms.
Optimization:
Update weights using methods like stochastic gradient descent (SGD) or Adam:
\[ w \leftarrow w - \eta \frac{\partial L}{\partial w} \]
where \(\eta\) is the learning rate.

5. Gated Recurrent Unit (GRU)

Gated Recurrent Units (GRUs) are a simplified version of Long Short-Term Memory (LSTM) networks designed for sequential data. They achieve similar performance to LSTMs while being computationally more efficient due to fewer parameters. GRUs effectively capture long-term dependencies in sequences, making them suitable for tasks like time-series forecasting, language modeling, and speech recognition.

Sub-Contents

Introduction to GRUs
Structure of a GRU Cell
Mathematical Equations of GRUs
Key Components of GRUs
Comparison with LSTMs

Title: Gated Recurrent Unit (GRU)

Introduction to GRUs GRUs were introduced to address the complexity of LSTMs while retaining their ability to model long-term dependencies. They combine the forget and input gates of an LSTM into a single update gate, simplifying the architecture and reducing computational cost.

Structure of a GRU Cell A GRU cell processes sequential data by maintaining a hidden state that evolves over time. Unlike LSTMs, GRUs do not have a separate cell state; instead, they directly update the hidden state.

The GRU uses two gates:

Update Gate: Controls the flow of information to update the hidden state.
Reset Gate: Controls how much of the previous information to forget.

Mathematical Equations of GRUs At each time step \(t\), the GRU processes the input \(x_t\) and the previous hidden state \(h_{t-1}\) to produce the current hidden state \(h_t\).

Reset Gate:
\[ r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r) \]
Controls how much of the past information to ignore.
Update Gate:
\[ z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) \]
Determines how much of the previous hidden state to retain and how much of the new information to incorporate.
Candidate Hidden State:
\[ \tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h) \]
Computes the new candidate state, incorporating the reset gate.
Hidden State Update:
\[ h_t = z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t \]
Combines the previous hidden state and the new candidate state based on the update gate.

Where:

\(W_*\), \(U_*\): Weight matrices
\(b_*\): Bias terms
\(\sigma\): Sigmoid activation function
\(\tanh\): Hyperbolic tangent activation function
\(\odot\): Element-wise multiplication

Key Components of GRUs

Update Gate (\(z_t\)):
Decides how much of the past information to retain.
Reset Gate (\(r_t\)):
Determines how much of the past hidden state to forget.
Simpler Architecture:
GRUs combine the functionality of LSTM’s forget and input gates into the update gate, reducing complexity.
Shared Hidden State:
Unlike LSTMs, GRUs do not use a separate cell state, streamlining computations.

Comparison with LSTMs

Feature	LSTM	GRU
Number of Gates	Three (forget, input, output)	Two (reset, update)
Cell State	Separate cell and hidden states	Single hidden state
Parameters	More parameters	Fewer parameters
Efficiency	Computationally intensive	More efficient
Performance	Effective for long sequences	Comparable, but may perform better for shorter sequences

6. GPT

6.1. Attention Mechanism

Sub-Contents:

What is the Attention Mechanism?
The Context: From Sequence Models to Attention
The Core Idea of Attention
Key Mathematical Formulation
Types of Attention Mechanisms
Real-World Analogies

What is the Attention Mechanism?

The attention mechanism is a concept in machine learning that helps models focus on the most relevant parts of input data when making decisions. Originally introduced in natural language processing (NLP), attention has revolutionized how machines understand and process sequential data, such as text, speech, and time-series data.

The Context: From Sequence Models to Attention

Traditional sequence models like recurrent neural networks (RNNs) and their variants (e.g., LSTMs, GRUs) faced challenges in handling long-range dependencies in sequences. As sequences became longer, these models struggled to retain information from earlier parts of the sequence due to the “vanishing gradient problem.”

Attention mechanisms emerged to address this by dynamically assigning weights to different parts of the input sequence, allowing the model to focus on the most important elements.

The Core Idea of Attention

Think of attention as how humans read a book. When trying to understand a story, you don’t read every word with the same focus. You instinctively pay more attention to critical words or sentences that contribute most to the story’s meaning. Similarly, the attention mechanism enables models to decide which parts of the input are most important for producing an output.

Key Mathematical Formulation

At its core, attention computes a weighted sum of the input representations, with weights dynamically learned during training. Here’s how it works step by step:

Input Representations: Assume we have a sequence of inputs \(\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n]\), where each \(\mathbf{x}_i\) is a vector.
Query, Key, Value Transformation:
- Query vector (\(\mathbf{q}\)): Represents the element seeking information.
- Key vector (\(\mathbf{k}\)): Represents the indexing of the information.
- Value vector (\(\mathbf{v}\)): Represents the information itself.
These are computed as:
\[ \mathbf{q}_i = \mathbf{W}_q \mathbf{x}_i, \quad \mathbf{k}_i = \mathbf{W}_k \mathbf{x}_i, \quad \mathbf{v}_i = \mathbf{W}_v \mathbf{x}_i \]
where \(\mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v\) are learnable weight matrices.
Attention Score: The relevance of each key to the query is measured using a similarity function, typically the dot product:
\[ \text{score}_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j \]
Softmax Normalization: The scores are normalized using the softmax function to ensure they sum to 1:
\[ \alpha_{ij} = \frac{\exp(\text{score}_{ij})}{\sum_{j=1}^n \exp(\text{score}_{ij})} \]
Here, \(\alpha_{ij}\) represents the attention weight for input \(\mathbf{x}_j\) when considering query \(\mathbf{q}_i\).
Weighted Sum: The output is a weighted sum of the value vectors:
\[ \mathbf{z}_i = \sum_{j=1}^n \alpha_{ij} \mathbf{v}_j \]

Types of Attention Mechanisms

Self-Attention (or Intra-Attention): Each input focuses on other parts of the same sequence to capture dependencies.
Cross-Attention: Used in encoder-decoder architectures (e.g., translation tasks), where the decoder attends to the encoder’s outputs.
Multi-Head Attention: Instead of one attention operation, multiple “heads” focus on different aspects of the input. This is central to the Transformer model.

Real-World Analogies

Highlighting Text in a Document: When reading a document, we highlight important sentences. Similarly, attention assigns weights to important parts of the sequence.
Search Engines: When you type a query in a search engine, it ranks results based on relevance. Attention works similarly, assigning relevance scores to input elements.
Detecting Faces in Photos: Just as your eyes focus on a face in a crowded photo, attention mechanisms help models focus on significant parts of the data.

6.2. Transformer Architecture

Sub-Contents:

The Need for Transformers
High-Level Structure of the Transformer
Encoder and Decoder Blocks
Positional Encoding
Feedforward Layers
Residual Connections and Layer Normalization

The Need for Transformers

Traditional models like RNNs and LSTMs processed sequences token by token, making them computationally expensive and prone to losing context over long sequences. Attention mechanisms addressed these shortcomings, but they still needed an efficient structure to scale effectively.

The Transformer architecture, introduced in the paper “Attention is All You Need”, completely removed recurrence, relying solely on attention mechanisms to process input sequences in parallel. This innovation enabled massive improvements in speed, scalability, and effectiveness.

High-Level Structure of the Transformer

The Transformer is composed of two main components:

Encoder: Processes the input sequence and produces a contextual representation.
Decoder: Uses the encoder’s output and generates the desired output sequence (e.g., translations).

Key innovation: Both components leverage self-attention and feedforward layers, working in tandem.

Encoder and Decoder Blocks

Encoder Block: The encoder processes input sequences and is composed of:

Multi-Head Self-Attention: Captures relationships between all tokens in the input.
Feedforward Layer: Applies a fully connected neural network to enhance non-linear transformations.
Residual Connections and Layer Normalization: Maintains gradient flow and ensures stable learning.

Decoder Block: The decoder generates output sequences by combining:

Masked Multi-Head Self-Attention: Prevents the model from “peeking” at future tokens during generation.
Cross-Attention: Attends to the encoder’s output to incorporate information from the input sequence.
Feedforward Layer, Residual Connections, and Layer Normalization: Similar to the encoder, ensuring consistent training.

Positional Encoding

Transformers lack inherent sequence-awareness because they process tokens in parallel. Positional encoding solves this by embedding position-specific information into input embeddings.

The encoding is defined as:

\[ PE_{\text{pos}, 2i} = \sin(\text{pos}/10000^{2i/d_{\text{model}}}) \]\[ PE_{\text{pos}, 2i+1} = \cos(\text{pos}/10000^{2i/d_{\text{model}}}) \]

Here:

\(\text{pos}\): Position in the sequence
\(i\): Dimension index
\(d_{\text{model}}\): Dimensionality of the embedding

This periodic function ensures that positions are encoded uniquely and relative distances can be inferred.

Feedforward Layers

Each Transformer layer includes a feedforward neural network that operates on each token independently:

\[ \text{FFN}(\mathbf{x}) = \max(0, \mathbf{x} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2 \]

This layer introduces non-linearity and expands the model’s capacity to capture complex relationships.

Residual Connections and Layer Normalization

To improve gradient flow and stabilize training, Transformers employ:

Residual Connections: Adds input directly to the output of each sublayer: \[ \text{Output} = \text{LayerNorm}(\text{Input} + \text{Sublayer}(\text{Input})) \]
Layer Normalization: Normalizes activations to enhance stability and convergence.

6.2.1. Encoder

The Transformer encoder is a stack of layers designed to process an input sequence and generate contextualized representations that capture the relationships between all elements of the sequence. Its purpose is to create embeddings enriched with global information about the sequence.

Components of an Encoder Layer

Input Embedding + Positional Encoding:
- Inputs (e.g., words) are converted to dense vectors (embeddings).
- Positional encodings are added to retain sequence order information.
Multi-Head Self-Attention:
- Computes relationships between all tokens in the input sequence.
- Outputs a weighted representation for each token based on its context.
Feedforward Neural Network:
- Applies transformations independently to each token’s representation.
- Captures complex patterns and relationships.
Residual Connections + Layer Normalization:
- Maintains stable gradient flow and speeds up convergence.
- Ensures that outputs are numerically stable and standardized.

Detailed Workflow of a Single Encoder Layer

Each encoder layer processes the input as follows:

Input Vector (Token Embedding + Positional Encoding):
\[ \mathbf{X}_{\text{input}} = [\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n] \]
Multi-Head Self-Attention:
- Computes attention scores for each token with every other token.
- Outputs contextually weighted representations.
Add & Normalize:
- Adds the input to the self-attention output: \[ \mathbf{X}_{\text{attn}} = \text{LayerNorm}(\mathbf{X}_{\text{input}} + \text{SelfAttention}(\mathbf{X}_{\text{input}})) \]
Feedforward Neural Network:
- Applies a two-layer perceptron with non-linearity: \[ \mathbf{X}_{\text{ffn}} = \text{LayerNorm}(\mathbf{X}_{\text{attn}} + \text{FFN}(\mathbf{X}_{\text{attn}})) \]
Output Vector:
- The output is passed to the next encoder layer or to the decoder in cross-attention tasks.

Encoder Structure: Mermaid.js Diagram

Here is a visual representation of a single encoder layer using Mermaid.js:

graph TD
    A[Input Embedding + Positional Encoding] --> B[Multi-Head Self-Attention]
    B --> C[Add & Normalize]
    C --> D[Feedforward Neural Network]
    D --> E[Add & Normalize]
    E --> F[Output for Next Layer or Decoder]

Intuition Behind the Encoder

Imagine reading a sentence: “The cat sat on the mat.” Each word carries meaning on its own, but its role in the sentence depends on its relationship with other words (e.g., “on” connects “sat” and “mat”). The encoder builds a rich representation of each word by considering its global context in the sentence.

6.2.2. Decoder

The decoder in the Transformer architecture is responsible for generating output sequences (e.g., translated sentences) by attending to both its own previously generated outputs and the contextualized representations produced by the encoder. It consists of multiple layers that combine self-attention, cross-attention, and feedforward networks.

Components of a Decoder Layer

Masked Multi-Head Self-Attention:
- Processes previously generated tokens.
- Masks future positions to prevent the model from “peeking” ahead during training.
Cross-Attention (Encoder-Decoder Attention):
- Attends to the encoder’s output to incorporate context from the input sequence.
Feedforward Neural Network:
- Applies non-linear transformations to the combined representation from attention mechanisms.
Residual Connections + Layer Normalization:
- Ensures stability and efficient gradient flow.

Workflow of a Single Decoder Layer

The decoder layer processes the input in three stages:

Masked Multi-Head Self-Attention:
- Operates only on the tokens generated so far.
- Prevents future information leakage by applying a mask: \[ \mathbf{Z}_{\text{self-attn}} = \text{MaskedSelfAttention}(\mathbf{Y}_{\text{input}}) \]
Add & Normalize:
- Adds the self-attention output to the input and normalizes: \[ \mathbf{Z}_{\text{add1}} = \text{LayerNorm}(\mathbf{Y}_{\text{input}} + \mathbf{Z}_{\text{self-attn}}) \]
Cross-Attention:
- Attends to the encoder’s output: \[ \mathbf{Z}_{\text{cross-attn}} = \text{CrossAttention}(\mathbf{Z}_{\text{add1}}, \mathbf{X}_{\text{encoder-output}}) \]
Add & Normalize:
- Adds the cross-attention output and normalizes: \[ \mathbf{Z}_{\text{add2}} = \text{LayerNorm}(\mathbf{Z}_{\text{add1}} + \mathbf{Z}_{\text{cross-attn}}) \]
Feedforward Neural Network:
- Applies a non-linear transformation: \[ \mathbf{Z}_{\text{ffn}} = \text{FeedForward}(\mathbf{Z}_{\text{add2}}) \]
Add & Normalize:
- Produces the final output of the layer: \[ \mathbf{Z}_{\text{output}} = \text{LayerNorm}(\mathbf{Z}_{\text{add2}} + \mathbf{Z}_{\text{ffn}}) \]

Decoder Structure: Mermaid.js Diagram

Here is a visual representation of a single decoder layer using Mermaid.js:

graph TD
    A[Previous Token Embedding + Positional Encoding] --> B[Masked Multi-Head Self-Attention]
    B --> C[Add & Normalize]
    C --> D[Cross-Attention (Encoder Output)]
    D --> E[Add & Normalize]
    E --> F[Feedforward Neural Network]
    F --> G[Add & Normalize]
    G --> H[Output for Next Layer or Final Prediction]

Key Innovations in the Decoder

Masked Self-Attention:
- Prevents future information leakage by masking upper triangular elements of the attention matrix.
Cross-Attention:
- Enables the decoder to leverage contextual information from the encoder, crucial for tasks like machine translation.
Stacked Structure:
- Multiple decoder layers build hierarchical and refined representations of the output sequence.

Intuition Behind the Decoder

Imagine you’re translating a sentence word by word. You base your next word choice on:

The words you’ve already translated.
The overall meaning of the source sentence.

The decoder replicates this process mathematically, ensuring coherence and relevance in its output.

6.3. GPT Architecture

The GPT (Generative Pretrained Transformer) architecture is a variant of the Transformer model tailored for generative tasks, such as text completion, summarization, and creative writing. It builds on the Transformer decoder architecture but is optimized for unidirectional (causal) text generation.

Sub-Contents:

The Core Idea Behind GPT
Architectural Overview
Differences Between GPT and Standard Transformers
Detailed Workflow of GPT
Mathematical Formulation
Mermaid.js Diagram for GPT Workflow

The Core Idea Behind GPT

GPT models are designed to predict the next token in a sequence, given the context of previous tokens. This makes them powerful for generative tasks where coherent and contextually relevant output is critical. The architecture leverages large-scale pretraining on diverse text data, followed by fine-tuning for specific tasks.

Architectural Overview

Transformer Decoder-Based Architecture:
- GPT is essentially a stack of Transformer decoder layers.
- It uses masked self-attention to process tokens sequentially.
Unidirectional Context:
- Unlike bidirectional models like BERT, GPT only considers preceding tokens when generating outputs.
Layer Components:
- Masked Multi-Head Self-Attention: Ensures that only prior tokens influence predictions.
- Feedforward Neural Networks: Applies non-linear transformations for better representation learning.
- Residual Connections and Layer Normalization: Stabilizes training and ensures smooth gradient flow.
Output Layer:
- A softmax layer maps the final token representation to a probability distribution over the vocabulary.

Differences Between GPT and Standard Transformers

Feature	GPT	Standard Transformers
Context Direction	Unidirectional (causal)	Bidirectional (encoder-decoder)
Masking	Applies causal masking	Masking depends on task
Application	Generative tasks	Encoder-decoder tasks
Cross-Attention	Not used	Present in decoders

Detailed Workflow of GPT

Input Tokenization:
- Text input is tokenized into subword units using a tokenizer like Byte-Pair Encoding (BPE).
Input Embeddings:
- Each token is converted into a dense vector representation.
Positional Encoding:
- Since GPT processes sequences in parallel, positional encoding is added to the embeddings to retain sequence order.
Stacked Transformer Decoder Layers:
- Masked Multi-Head Self-Attention: Computes attention scores only for previous tokens.
- Feedforward Layers: Applies non-linear transformations for richer token representations.
- Residual Connections and Layer Normalization: Ensures stability during training.
Output Projection:
- The final representation is passed through a softmax layer to predict the next token.
Token Generation:
- The model generates tokens iteratively, appending one token at a time to the sequence.

Mathematical Formulation

Input Representation:
\[ \mathbf{X}_{\text{input}} = \text{Embed}(\text{Tokens}) + \text{Positional Encoding} \]
Masked Self-Attention: The attention weights are computed as:
\[ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} + \mathbf{M}\right) \mathbf{V} \]
Here:
- \(\mathbf{Q}, \mathbf{K}, \mathbf{V}\) are query, key, and value matrices.
- \(\mathbf{M}\) is the causal mask, ensuring only past tokens are attended to.
Output Probability: The probability of the next token is:
\[ P(\text{next token} | \text{previous tokens}) = \text{softmax}(\mathbf{W} \mathbf{z}_n + \mathbf{b}) \]
where \(\mathbf{z}_n\) is the final token representation.

Mermaid.js Diagram for GPT Workflow

graph TD
    A[Input Tokens] --> B[Token Embeddings]
    B --> C[Add Positional Encoding]
    C --> D[Stack of Transformer Decoder Layers]
    D --> E[Masked Multi-Head Self-Attention]
    E --> F[Feedforward Neural Network]
    F --> G[Add & Normalize]
    G --> H[Output Layer (Softmax)]
    H --> I[Predicted Next Token]
    I --> J[Iterative Token Generation]

Last updated on February 28, 2025

Flow-Based Models: Invertible Generative Modeling in AI A Comprehensive Guide to Deep Learning Algorithms: Neural Networks, CNNs, RNNs, and More