Mathematical Foundations of Natural Language Processing: Word Embeddings, Tokenization, and Vectorization Techniques

Raj Shaikh 8 min read 1535 words

1. Word Embeddings: From Words to Numbers 🧮

What Are Word Embeddings?

Imagine if words were superheroes 🦸‍♀️. Each has unique powers (meanings) but needs a costume (vector) to represent it in AI’s world of numbers. Word embeddings turn words into dense numerical vectors that capture their meanings and relationships.

How Do Word Embeddings Work?

Word embeddings are dense, low-dimensional vectors where similar words (in context) are close together in the vector space. For example:

“King” and “Queen” are neighbors.
“Apple” and “Banana” are buddies, but far from “Laptop.”

Techniques for Creating Word Embeddings

1. Word2Vec: The Context Wizard 🔮

Word2Vec uses a neural network to learn word embeddings. It comes in two flavors:

Skip-Gram: Predicts the context (surrounding words) given a target word.
CBOW (Continuous Bag of Words): Predicts the target word given its context.

The key idea? Words that appear in similar contexts have similar meanings.

Objective for Skip-Gram:

\[ \text{Maximize } \prod_{t=1}^T \prod_{-c \leq j \leq c, j \neq 0} P(w_{t+j} | w_t) \]

Where:

\( w_t \): Target word.
\( w_{t+j} \): Context word.
\( c \): Context window size.

2. GloVe: The Co-Occurrence Guru 🤝

GloVe (Global Vectors for Word Representation) uses word co-occurrence statistics to learn embeddings. It captures the relationships between words by analyzing how often they appear together in a large corpus.

Loss Function:

\[ J = \sum_{i,j=1}^n f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2 \]

Where:

\( X_{ij} \): Co-occurrence count of word \( i \) and \( j \).
\( w_i \): Word embedding for word \( i \).
\( \tilde{w}_j \): Context embedding for word \( j \).

Why Word Embeddings Matter

Semantic Understanding:
- Embeddings capture nuanced relationships like “Paris - France = London - England.”
Efficiency:
- Dense vectors are more compact than one-hot encodings.
Transfer Learning:
- Pre-trained embeddings can be reused across NLP tasks.

Numerical Example

Let’s say Word2Vec learns the following 2D embeddings:

King: \([2, 5]\)
Queen: \([3, 5]\)
Man: \([1, 4]\)
Woman: \([2, 4]\)

What’s the relationship between “King” and “Queen”?

Difference Vector:

\[ \text{King - Man} = [2, 5] - [1, 4] = [1, 1] \]\[ \text{Woman + (King - Man)} = [2, 4] + [1, 1] = [3, 5] \quad (\text{Queen!}) \]

Word embeddings magically capture analogies like “King is to Man as Queen is to Woman.” 🤩

Code Example: Using Pre-Trained Word2Vec

Let’s load and explore pre-trained Word2Vec embeddings:

from gensim.models import KeyedVectors

# Load pre-trained Word2Vec embeddings
word_vectors = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

# Check similarity
similarity = word_vectors.similarity("king", "queen")
print("Similarity between 'king' and 'queen':", similarity)

# Find analogy: King - Man + Woman = ?
result = word_vectors.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
print("Result of analogy (King - Man + Woman):", result)

Fun Analogy

Word embeddings are like a digital neighborhood map 🗺️:

King and Queen live in the same castle district.
Apple and Banana are roommates in the fruit dorm.
Laptop? It’s chilling in Tech Town. 💻

Mermaid.js Diagram: Word Embedding Flow

graph TD
    Corpus[Input Text Corpus] --> Tokenization[Tokenization]
    Tokenization --> WordEmbeddings[Learn Word Embeddings]
    WordEmbeddings --> Word2Vec[Word2Vec]
    WordEmbeddings --> GloVe[GloVe]
    Word2Vec --> Context[Capture Context]
    GloVe --> CoOccurrence[Capture Co-Occurrence]
    Context --> FinalEmbeddings[Word Embeddings]
    CoOccurrence --> FinalEmbeddings

2. Tokenization and Vectorization: Slicing Text Like a Chef 🧑‍🍳🔪

What is Tokenization?

Tokenization is the process of breaking down text into smaller units, called tokens. These tokens can be:

Words: “I love AI” → [“I”, “love”, “AI”]
Subwords: “Unbreakable” → [“Un”, “break”, “able”]
Characters: “AI” → [“A”, “I”]
Custom chunks: Depends on the task!

It’s like turning a loaf of bread into slices—easier to handle and digest. 🍞

Types of Tokenization

Word-Level Tokenization:
- Splits text into individual words.
- Example: “ChatGPT rocks!” → [“ChatGPT”, “rocks!”]
Subword Tokenization:
- Breaks words into meaningful subunits.
- Example: “unbreakable” → [“un”, “break”, “able”]
- Why? Handles unknown words like “unicorns” by splitting into known subparts.
Character-Level Tokenization:
- Treats every character as a token.
- Example: “AI” → [“A”, “I”]
- Why? Great for languages with complex scripts (e.g., Chinese).
Byte-Pair Encoding (BPE):
- Merges the most frequent pairs of characters or subwords iteratively.
- Example: Start with [“H”, “e”, “l”, “l”, “o”] → [“He”, “l”, “l”, “o”] → [“Hell”, “o”].

What is Vectorization?

After tokenization, we need to convert tokens into numerical form (because AI doesn’t speak English, it speaks numbers). This process is called Vectorization.

Methods of Vectorization

One-Hot Encoding:
- Represents each token as a binary vector.
- Example: [“cat”, “dog”, “fish”]: \[ \text{cat} = [1, 0, 0], \, \text{dog} = [0, 1, 0], \, \text{fish} = [0, 0, 1] \]
- Problem: Inefficient for large vocabularies (sparse vectors).
Frequency-Based Encoding:
- Counts how often each word appears.
- Example: “I love AI. AI is amazing.” → {“I”: 2, “love”: 1, “AI”: 2, “is”: 1, “amazing”: 1}.
TF-IDF (Term Frequency-Inverse Document Frequency):
- Adjusts word importance based on how often it appears across documents.
- Example: Common words like “the” get low scores, while unique words like “ChatGPT” get high scores.
Word Embeddings:
- Dense vectors learned from data (e.g., Word2Vec, GloVe).

Numerical Example: One-Hot Encoding

Text: “AI is fun.”

Unique Tokens:
- Vocabulary: [“AI”, “is”, “fun”]
One-Hot Encoding:
- AI → [1, 0, 0]
- is → [0, 1, 0]
- fun → [0, 0, 1]

Why Tokenization and Vectorization Matter

Bridge the Gap:
- Text → Numbers → Machine Learning.
Efficient Processing:
- Converts messy, variable-length text into fixed-size representations.
Foundation of NLP:
- Everything from chatbots to language models starts here.

Code Example: Tokenization and Vectorization

Let’s tokenize and vectorize a sentence using Python:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text
texts = ["I love AI", "AI is amazing", "Love AI forever"]

# Tokenization and One-Hot Encoding
vectorizer = CountVectorizer(binary=True)
one_hot = vectorizer.fit_transform(texts)

print("One-Hot Encoded Vectors:\n", one_hot.toarray())
print("Vocabulary:\n", vectorizer.vocabulary_)

# Tokenization and TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(texts)

print("\nTF-IDF Vectors:\n", tfidf.toarray())
print("TF-IDF Vocabulary:\n", tfidf_vectorizer.vocabulary_)

Fun Analogy

Tokenization is like slicing a pizza 🍕:

Word Tokenization: “I love pizza” → Each slice (word) is a token.
Subword Tokenization: “Pepperoni” → Break it into smaller pieces (“pepper,” “oni”).
Character Tokenization: “AI” → You obsessively cut each letter into tiny bites.

Vectorization? That’s like assigning each slice a score:

“Pepperoni: 10/10, Pineapple: -1/10 (just kidding… or am I?)”

Mermaid.js Diagram: Tokenization and Vectorization Flow

graph TD
    Text[Input Text] --> Tokenization[Tokenization: Break into Tokens]
    Tokenization --> Vectorization[Vectorization: Convert Tokens to Numbers]
    Vectorization --> OneHot[One-Hot Encoding]
    Vectorization --> TFIDF[TF-IDF]
    Vectorization --> Embeddings[Word Embeddings]
    OneHot --> Processed[Processed Data]
    TFIDF --> Processed
    Embeddings --> Processed

3. Cosine Similarity: Finding the Perfect Vector Match 💘📏

What is Cosine Similarity?

Cosine Similarity measures the similarity between two vectors by calculating the cosine of the angle between them. The smaller the angle, the more similar the vectors are.

Why Cosine?

The cosine ignores the magnitude of the vectors, focusing only on their direction. Perfect for text data where frequency counts or embeddings can vary in scale.

Formula

\[ \text{Cosine Similarity} = \cos \theta = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} \]

Where:

\( \mathbf{A} \cdot \mathbf{B} \): Dot product of vectors \( \mathbf{A} \) and \( \mathbf{B} \).
\( \|\mathbf{A}\| \): Magnitude (length) of vector \( \mathbf{A} \).
\( \|\mathbf{B}\| \): Magnitude of vector \( \mathbf{B} \).

Why Cosine Similarity is Awesome

Range:
- Cosine similarity ranges from \( -1 \) (opposite) to \( 1 \) (identical).
- \( 0 \): No similarity (orthogonal vectors).
Normalized:
- Focuses on direction, not magnitude—great for text data.
Applications:
- Document Similarity: Compare news articles, reviews, etc.
- Word Embedding Matching: Find related words or phrases.
- Search Engines: Rank results based on similarity.

Numerical Example

Let’s calculate Cosine Similarity for two sentences represented as vectors:

\[ \mathbf{A} = [1, 2, 3], \quad \mathbf{B} = [4, 5, 6] \]

Dot Product:
\[ \mathbf{A} \cdot \mathbf{B} = (1 \cdot 4) + (2 \cdot 5) + (3 \cdot 6) = 4 + 10 + 18 = 32 \]
Magnitude:
\[ \|\mathbf{A}\| = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{1 + 4 + 9} = \sqrt{14} \]\[ \|\mathbf{B}\| = \sqrt{4^2 + 5^2 + 6^2} = \sqrt{16 + 25 + 36} = \sqrt{77} \]
Cosine Similarity:
\[ \cos \theta = \frac{32}{\sqrt{14} \cdot \sqrt{77}} \approx \frac{32}{32.83} \approx 0.975 \]

The vectors are highly similar! 💕✨

Code Example: Calculating Cosine Similarity

Here’s how to compute Cosine Similarity in Python:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Vectors
A = np.array([[1, 2, 3]])
B = np.array([[4, 5, 6]])

# Compute Cosine Similarity
similarity = cosine_similarity(A, B)
print("Cosine Similarity:", similarity[0][0])

For multiple vectors:

# Set of vectors
vectors = np.array([[1, 2, 3], [4, 5, 6], [1, 0, 0]])

# Pairwise Cosine Similarity
pairwise_sim = cosine_similarity(vectors)
print("Pairwise Cosine Similarity:\n", pairwise_sim)

Why It Matters in NLP

Semantic Search:
- Search engines rank results by similarity to the query.
Recommendation Systems:
- Recommend similar articles, books, or songs based on user preferences.
Plagiarism Detection:
- Compare documents to detect overlaps.

Fun Analogy

Cosine Similarity is like judging how well two people align on a dating app 💘:

If their life goals (vectors) point in the same direction, it’s a match! 🎉
If one loves skydiving and the other loves naps, their vectors are far apart. 🛌🪂

Mermaid.js Diagram: Cosine Similarity Flow

graph TD
    VectorA[Vector A] --> DotProduct[Compute Dot Product A · B]
    VectorB[Vector B] --> DotProduct
    VectorA --> MagnitudeA[Compute Magnitude of A]
    VectorB --> MagnitudeB[Compute Magnitude of B]
    DotProduct --> CosineFormula[Apply Cosine Formula]
    MagnitudeA --> CosineFormula
    MagnitudeB --> CosineFormula
    CosineFormula --> SimilarityScore[Cosine Similarity Score]

Last updated on February 28, 2025

Mathematics Behind Deep Generative Models: VAEs, GANs, and Diffusion Models Mathematical Foundations of Large Language Models: Training Objectives, Token Probability, and Loss Functions