Exploring Various Word Embedding Techniques in NLP

Raj Shaikh 17 min read 3550 words

In the grand universe of Natural Language Processing (NLP), words are our stars, and understanding their relationships is like charting constellations. But how do we make machines understand words? Machines, unfortunately, don’t “get” language like we do. They need words to be translated into numbers. Enter word embeddings—the numerical representations of words in a multi-dimensional space. These techniques are foundational to NLP, powering chatbots, search engines, and even your favorite autocomplete on messaging apps.

This blog will take you on a journey to explore various word embedding techniques. We’ll demystify the jargon, explain the magic behind the math, and even sprinkle in some code and diagrams to bring the concepts to life. Buckle up, because we’re diving into the language of the machines!

1. What are Word Embeddings? Why Do We Need Them?

Before diving into techniques, let’s grasp the core idea of word embeddings. In simple terms, embeddings map words to vectors (a list of numbers) in such a way that similar words have similar vectors. The goal? Capture not just the meaning of words but also their relationships and context.

Analogy:

Think of a world map. Cities close to each other often share cultural similarities—like Paris and London. Similarly, words with similar meanings or usage should “live” close to each other in this multi-dimensional space. For instance, the word “king” might be close to “queen,” and “dog” close to “cat.”

Why Not Just Use Plain Numbers?

Imagine encoding words with unique integers: “cat” = 1, “dog” = 2. Does the fact that “cat” is closer to “dog” on this scale mean they are similar in meaning? Nope. That’s why we need embeddings—so that numbers reflect relationships and meanings.

Next up, let’s look at how we started with simpler techniques like One-Hot Encoding and Frequency-Based Methods.

2. Classical Techniques: One-Hot Encoding and Frequency-Based Methods

Let’s start with the humble beginnings of embeddings: simple yet powerful ways to represent text before the rise of neural networks.

One-Hot Encoding

One-hot encoding represents each word as a binary vector with a dimension equal to the size of the vocabulary. If your vocabulary has 10,000 words, each word gets a vector of length 10,000, where one position is “1” (the word itself) and all others are “0.”

Example:

For a vocabulary containing [“cat”, “dog”, “mouse”], the vectors are:

“cat”: [1, 0, 0]
“dog”: [0, 1, 0]
“mouse”: [0, 0, 1]

Why It’s Problematic:

High Dimensionality: Imagine a vocabulary of 50,000 words. That’s 50,000-dimensional vectors!
No Semantic Meaning: There’s no indication that “cat” and “dog” are more related than “cat” and “mouse.”

Frequency-Based Methods

Next came frequency-based representations like TF-IDF (Term Frequency-Inverse Document Frequency). Instead of just presence (like in one-hot encoding), these methods factor in the importance of a word within a document.

Formula for TF-IDF:

\[ \text{TF-IDF}(w, d) = \text{TF}(w, d) \times \text{IDF}(w) \]

Where:

TF: Term frequency (how often word \(w\) appears in document \(d\)).
IDF: Inverse document frequency (\(\text{log} \frac{N}{n_w}\)), where \(N\) is the total number of documents and \(n_w\) is the number of documents containing \(w\).

Drawbacks:

Still lacks semantic relationships.
Sparse representations lead to inefficiency in larger vocabularies.

Code Example for TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["The cat sat on the mat.", "The dog barked loudly."]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

In the next response, we’ll dive into predictive models, including the revolutionary Word2Vec and its two flavors: CBOW and Skip-Gram. These models changed the NLP game forever, so stay tuned! 🚀

3. Predictive Models: Word2Vec (CBOW and Skip-Gram)

While frequency-based methods and one-hot encoding were useful, they lacked one critical feature: understanding relationships between words. Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013, was a game-changer. It was the first technique to use neural networks to create dense, low-dimensional word vectors capturing semantic relationships.

Word2Vec Overview

Word2Vec comes in two flavors:

Continuous Bag of Words (CBOW): Predicts a target word based on its surrounding context.
Skip-Gram: Predicts context words given a target word.

Think of these as the two sides of a coin, each with its unique approach to extracting relationships.

CBOW: Predicting the Missing Word

In CBOW, the model uses the words surrounding a target word to predict the target itself. It’s like playing a game of “Guess the Word” where the context words are the clues.

Example:

Sentence: “The cat sat on the mat.”

Context words for “sat”: [“The”, “cat”, “on”, “the”, “mat”]
CBOW learns to predict “sat” from the surrounding words.

Mathematical Formulation:

CBOW maximizes the probability:

\[ P(w_t \mid w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}) \]

Where:

\(w_t\) is the target word.
\(w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}\) are context words.

Skip-Gram: Predicting Context Words

Skip-Gram flips the problem. It predicts the context words given a target word. It’s like saying, “I know ‘sat’; what words are likely to surround it?”

Example:

Target word: “sat”

Predicts: [“The”, “cat”, “on”, “the”, “mat”]

Mathematical Formulation:

Skip-Gram maximizes the probability:

\[ P(w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2} \mid w_t) \]

Training Word2Vec

Both CBOW and Skip-Gram rely on a shallow neural network with three layers:

Input Layer: Encodes words as one-hot vectors.
Hidden Layer: Captures word relationships in a dense, low-dimensional space.
Output Layer: Predicts the probabilities of target/context words.

Loss Function:

Word2Vec minimizes the negative log-likelihood of predicted probabilities:

\[ L = - \sum_{t} \sum_{w_c \in \text{context}(w_t)} \log P(w_c \mid w_t) \]

Optimization Techniques:

To make computation efficient, Word2Vec uses:

Hierarchical Softmax: Approximates probabilities using a binary tree.
Negative Sampling: Simplifies the loss by focusing on a subset of “negative” examples.

Code Example: Word2Vec with Gensim

from gensim.models import Word2Vec

# Sample sentences
sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "barked", "loudly"]
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, sg=0)  # sg=0 for CBOW, sg=1 for Skip-Gram

# Check word embeddings
print("Vector for 'cat':", model.wv['cat'])
print("Similarity between 'cat' and 'dog':", model.wv.similarity('cat', 'dog'))

Visualizing Word2Vec Relationships

Here’s a sneak peek of what Word2Vec achieves. In the vector space, relationships like “king - man + woman ≈ queen” emerge naturally. It’s as if Word2Vec intuitively understands grammar and semantics!

4. GloVe: A Statistical Approach to Word Embeddings

While Word2Vec focuses on predicting relationships between words using neural networks, GloVe (Global Vectors for Word Representation) takes a different route. Introduced by researchers at Stanford, GloVe is a statistical approach that leverages the co-occurrence of words across a large corpus to create embeddings. Think of it as the data nerd in the word-embedding family: it loves counting word pairs and uncovering hidden patterns.

The Core Idea Behind GloVe

GloVe is built on the principle that the meaning of a word can be derived from its distributional properties in a corpus. Specifically, it analyzes how often pairs of words appear together (co-occur) across all documents and uses this information to generate embeddings.

Mathematical Foundation

The primary insight of GloVe is that the ratio of co-occurrence probabilities reveals word relationships. For two words, \(w_i\) and \(w_j\), their co-occurrence is captured by:

\[ P_{ij} = \frac{X_{ij}}{\sum_k X_{ik}} \]

Where:

\(X_{ij}\) is the number of times \(w_j\) appears in the context of \(w_i\).
\(P_{ij}\) represents the probability of \(w_j\) occurring in the context of \(w_i\).

GloVe’s Objective Function:

The embeddings are trained to satisfy the following equation:

\[ w_i^T w_j + b_i + b_j = \log(X_{ij}) \]

Where:

\(w_i\) and \(w_j\) are the embeddings of words \(i\) and \(j\).
\(b_i\) and \(b_j\) are bias terms.
\(\log(X_{ij})\) represents the logarithm of co-occurrence counts.

To minimize the error in the equation, GloVe uses the least-squares error function:

\[ J = \sum_{i,j} f(X_{ij}) \cdot (w_i^T w_j + b_i + b_j - \log(X_{ij}))^2 \]

The weighting function \(f(X_{ij})\) ensures that rare co-occurrences don’t dominate the training process.

Why GloVe Is Special

Global Context: GloVe leverages the entire co-occurrence matrix, unlike Word2Vec, which only considers local context windows.
Efficient Representation: By using matrix factorization, GloVe provides dense, low-dimensional embeddings.
Semantic Properties: GloVe embeddings capture relationships like:
- Linear substructures: king - man + woman ≈ queen
- Clusters: Words with similar meanings group together in vector space.

Example: Word Co-occurrence Matrix

Let’s consider a tiny corpus:

“The cat sat on the mat.”
“The dog barked loudly.”

The co-occurrence matrix might look like this:

	the	cat	sat	on	mat	dog	barked	loudly
the	0	2	1	1	1	1	1	1
cat	2	0	1	0	1	0	0	0
sat	1	1	0	1	1	0	0	0

GloVe uses this matrix to learn embeddings, preserving co-occurrence relationships in a compressed, vectorized form.

Code Example: Training GloVe with Prebuilt Tools

While training GloVe from scratch is computationally expensive, prebuilt implementations make life easier. Let’s use the glove-python-binary library:

from glove import Corpus, Glove

# Prepare a sample corpus
sentences = [
    "the cat sat on the mat".split(),
    "the dog barked loudly".split()
]

# Build the co-occurrence matrix
corpus = Corpus()
corpus.fit(sentences, window=2)

# Train GloVe model
glove = Glove(no_components=50, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=20, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

# Get vector for a word
print("Vector for 'cat':", glove.word_vectors[glove.dictionary['cat']])

Mermaid.js Diagram of GloVe Workflow

graph LR
A[Input Corpus] --> B[Co-occurrence Matrix]
B --> C[Logarithmic Transformation]
C --> D[Loss Function Minimization]
D --> E[Word Embeddings]

Challenges in Implementing GloVe

Memory Constraints:
- The co-occurrence matrix can become massive for large corpora.
- Solution: Use sparse matrix representations and optimized libraries like NumPy or SciPy.
Slow Convergence:
- Training can take a long time for large datasets.
- Solution: Increase threads or use GPUs for parallel computation.
Domain-Specific Vocabulary:
- GloVe struggles with rare or domain-specific words.
- Solution: Combine with subword techniques like FastText.

5. Contextual Embeddings: ELMo, BERT, and the Era of Dynamic Word Representations

As impressive as Word2Vec and GloVe are, they fall short in one key area: context. Both assign a single vector to each word, regardless of its meaning in different contexts. For instance, the word “bank” could mean a financial institution or the side of a river, but traditional embeddings like Word2Vec treat it the same in all sentences. This is where contextual embeddings shine.

What Are Contextual Embeddings?

Contextual embeddings dynamically generate word representations based on the context in which the word appears. Instead of a fixed embedding, the vector adapts to the sentence. These embeddings are created using deep neural networks that analyze entire sequences of words, capturing subtle nuances and relationships.

Key Innovations in Contextual Embeddings

Dynamic Representations: Words like “bank” have different embeddings depending on the sentence.
Deep Learning: Instead of shallow architectures, these models use deep networks like LSTMs and Transformers.
Bidirectional Understanding: Unlike earlier methods, contextual embeddings consider both the left and right contexts of a word.

ELMo (Embeddings from Language Models)

Introduced by the Allen Institute for AI in 2018, ELMo was a trailblazer in contextual embeddings. It uses a deep bidirectional LSTM (Long Short-Term Memory) network to generate word representations.

How ELMo Works:

Trains a language model on a large corpus.
Generates embeddings by considering the word in both forward and backward contexts.
Combines these representations into a single, dynamic vector for each word.

Mathematical Formulation:

For a sequence of words \( w_1, w_2, \dots, w_n \):

\[ ELMo(w_k) = \gamma \sum_{j=1}^L s_j h_{j,k} \]

Where:

\(h_{j,k}\): Hidden state of the \(j^{th}\) LSTM layer for word \(w_k\).
\(s_j\): Learned scalar weight for each layer.
\(\gamma\): Scaling parameter.

Code Example with ELMo (Using TensorFlow Hub):

import tensorflow_hub as hub

# Load ELMo model
elmo = hub.load("https://tfhub.dev/google/elmo/3")

# Sample sentence
sentence = ["The cat sat on the mat."]
embeddings = elmo.signatures["default"](tf.constant(sentence))["elmo"]

print("Shape of ELMo embeddings:", embeddings.shape)

BERT (Bidirectional Encoder Representations from Transformers)

If ELMo was revolutionary, BERT took the world by storm. Released by Google in 2018, BERT introduced the power of Transformers and a bidirectional training approach that understands the full context of a word in a sentence.

How BERT Works:

Masked Language Model (MLM):
- Randomly masks some words in the input and trains the model to predict them.
- Forces the model to consider the entire context, not just a single direction.
Next Sentence Prediction (NSP):
- Trains the model to predict whether two sentences follow each other in a document.

Key Features of BERT:

Transformer Architecture: Uses attention mechanisms to capture relationships across the entire input sequence.
Bidirectionality: Considers both left and right contexts simultaneously.

Mathematical Formulation:

BERT uses self-attention to compute contextualized embeddings:

\[ Attention(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V \]

Where:

\(Q, K, V\): Query, Key, and Value matrices derived from the input.
\(d_k\): Dimensionality of the keys.

Code Example with BERT (Using Hugging Face):

from transformers import BertTokenizer, BertModel

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Encode a sentence
sentence = "The bank of the river was calm."
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)

# Extract embeddings
embeddings = outputs.last_hidden_state
print("Shape of BERT embeddings:", embeddings.shape)

Comparison of ELMo and BERT

Feature	ELMo	BERT
Architecture	BiLSTM	Transformer
Context	Left-to-right and right-to-left (separate)	Bidirectional (joint)
Training Tasks	Language modeling	MLM and NSP
Flexibility	Fixed for downstream tasks	Tunable for downstream tasks (fine-tuning)

Mermaid.js Diagram of BERT Workflow

graph TD
A[Input Sentence] --> B[Tokenization]
B --> C[Transformer Layers]
C --> D[Self-Attention Mechanism]
D --> E[Contextual Word Embeddings]

Challenges with Contextual Embeddings

Computational Cost:
- Models like BERT are resource-intensive.
- Solution: Use smaller variants like DistilBERT or pre-computed embeddings for lightweight applications.
Fine-Tuning Complexity:
- Adapting BERT for specific tasks requires careful tuning.
- Solution: Experiment with learning rates and batch sizes or use task-specific adapters.
Out-of-Vocabulary (OOV) Handling:
- BERT uses subword tokenization (e.g., WordPiece), which can fragment words.
- Solution: Use larger vocabularies or refine tokenization strategies.

6. Subword Embeddings: FastText and Byte-Pair Encoding (BPE)

While techniques like Word2Vec, GloVe, and even contextual embeddings like BERT work wonders, they often struggle with rare words or misspellings. For instance, how should a model handle “catt” instead of “cat” or understand new words like “bioluminescence”? Enter subword embeddings, which break words into smaller, meaningful units to overcome these challenges.

Why Subword Embeddings?

Handling Rare Words: Traditional embeddings assign rare or unseen words random vectors. Subword embeddings create representations by combining smaller parts like character n-grams or subword tokens.
Dealing with Misspellings: Even small changes in spelling, like “cattt” or “dogg”, result in dramatically different embeddings in older techniques. Subword methods reduce this problem by focusing on overlapping character sequences.
Understanding Morphology: Subwords naturally capture relationships between words with shared roots or prefixes (e.g., “run” → “running”).

FastText: Subword-Level Word Embeddings

Developed by Facebook, FastText extends Word2Vec by incorporating subword information. Instead of learning a single vector per word, FastText represents words as combinations of character n-grams.

How FastText Works

Breaking Words into N-Grams:
- Words are split into overlapping sequences of characters.
- Example: For “cat” with \(n = 3\):
  - N-grams: <ca, cat, at>
Embedding the N-Grams:
- Each n-gram gets its own embedding.
- The word embedding is the sum of its n-gram embeddings.

Mathematical Formulation

If a word \( w \) is represented by a set of its character n-grams \( G(w) \), the word vector \( \mathbf{v}_w \) is computed as:

\[ \mathbf{v}_w = \sum_{g \in G(w)} \mathbf{v}_g \]

Where:

\( G(w) \): Set of n-grams for word \( w \).
\( \mathbf{v}_g \): Embedding of the n-gram \( g \).

Example: FastText in Action

Imagine the word “biology”:

N-grams: <bi, bio, iol, log, ogy, gy>.
FastText combines the embeddings of these n-grams to form the embedding for “biology”.

Code Example Using FastText:

from gensim.models import FastText

# Sample corpus
sentences = [
    ["biology", "is", "the", "study", "of", "life"],
    ["fasttext", "uses", "subwords", "for", "embeddings"]
]

# Train FastText model
model = FastText(sentences, vector_size=50, window=3, min_count=1)

# Get embedding for a word
print("Vector for 'biology':", model.wv['biology'])

# Handle out-of-vocabulary word
print("Vector for 'biologee':", model.wv['biologee'])  # Similar to 'biology'

Byte-Pair Encoding (BPE): Subword Tokenization

While FastText focuses on character n-grams, BPE is a popular tokenization technique used in modern models like GPT and BERT. BPE breaks words into smaller, frequently occurring units (subwords), allowing embeddings to be shared across similar words.

How BPE Works

Start with Characters:
- Each word is initially represented as a sequence of characters.
Merge Frequent Pairs:
- The most frequent pairs of characters are merged into subwords iteratively.
Form a Vocabulary:
- After several iterations, a vocabulary of subwords is created.

Example: Applying BPE to “uncommon”

Start: ["u", "n", "c", "o", "m", "m", "o", "n"]
Merge frequent pairs: ["un", "c", "o", "m", "m", "o", "n"]
Merge again: ["un", "com", "m", "o", "n"]
Final subwords: ["un", "common"]

Mathematical Advantage

By reusing subword tokens, BPE can handle unseen words by decomposing them into familiar subwords. For instance, “bioluminescence” might be split as:

\[ \text{"bio"} + \text{"luminescence"} \]

Both subwords likely exist in the model’s vocabulary, providing meaningful embeddings.

Code Example Using BPE (with Hugging Face):

from tokenizers import ByteLevelBPETokenizer

# Train BPE tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(["biology is amazing", "fasttext is robust"], vocab_size=100)

# Tokenize a word
tokens = tokenizer.encode("bioluminescence")
print("Tokens:", tokens.tokens)

Comparison: FastText vs. BPE

Feature	FastText	BPE
Basis	Character n-grams	Subword tokens
Handling Rare Words	Combines n-grams to approximate	Decomposes into frequent subwords
Computational Complexity	Lightweight	Requires pretraining vocabulary
Use Case	Out-of-vocabulary words	Tokenization for transformers

Mermaid.js Diagram: Subword Embedding Workflow

graph TD
A[Input Word] --> B[Character Splitting]
B --> C[N-Grams : FastText]
B --> D[Frequent Pair Merging : BPE]
C --> E[Subword Embeddings]
D --> E
E --> F[Final Word Embedding]

Challenges and Solutions

Misspellings and Noisy Data:
- Challenge: Excessive noise in the corpus can lead to meaningless n-grams.
- Solution: Preprocess data to clean text and standardize spelling.
Memory Usage for Large Corpora:
- Challenge: Storing embeddings for all n-grams or subwords can be memory-intensive.
- Solution: Use compression techniques or limit vocabulary size.
Choosing Optimal Parameters:
- Challenge: Selecting the right n-gram size or BPE iterations is non-trivial.
- Solution: Experiment with hyperparameters using validation data.

7. Grand Finale: Comparing Word Embedding Techniques and Emerging Trends

Now that we’ve explored the major word embedding techniques, let’s wrap up by comparing them, understanding their strengths and weaknesses, and discussing how to choose the right approach for your NLP tasks. We’ll also peek into the future of embeddings and how they’re evolving with the rise of advanced AI models.

Comparative Analysis of Word Embedding Techniques

Technique	Key Idea	Strengths	Weaknesses
One-Hot Encoding	Binary vector for each word	Simple and intuitive	High dimensionality, no semantic meaning
TF-IDF	Frequency-based	Captures importance of words in documents	Sparse, lacks context
Word2Vec (CBOW)	Predict target word from context	Efficient, captures semantic relationships	Context-independent embeddings
Word2Vec (Skip-Gram)	Predict context words from target word	Works well for rare words	Requires more computation
GloVe	Factorizes co-occurrence matrix	Captures global statistical relationships	Memory-intensive, context-independent
ELMo	Contextual embeddings via BiLSTMs	Handles polysemy (word with multiple meanings), dynamic representations	Computationally expensive, replaced by Transformers in many applications
BERT	Contextual embeddings via Transformers	Bidirectional context understanding, fine-tunable for tasks	Resource-heavy, large model sizes
FastText	Subword-level n-grams	Handles rare words, morphology-aware	Limited scalability for very large datasets
BPE	Tokenization into subword units	Efficient for rare and compound words	Requires careful pretraining and vocabulary management

Choosing the Right Technique

Selecting an embedding technique depends on your application, computational resources, and data characteristics:

Simple Applications:
- Use TF-IDF or Word2Vec (CBOW) for tasks like text classification or clustering where context isn’t critical.
Context-Dependent Tasks:
- For tasks like sentiment analysis or machine translation, opt for BERT or ELMo to capture word context dynamically.
Dealing with Rare Words:
- If your data contains many rare or unseen words, FastText or BPE is the way to go.
Large-Scale Applications:
- Use pre-trained models like BERT, GPT, or FastText to save training time and leverage robust embeddings.

Emerging Trends in Word Embeddings

The field of NLP is evolving rapidly, and embeddings are at the forefront of innovation. Here are some exciting developments:

1. Multilingual Embeddings

Models like mBERT and XLM-R create embeddings for multiple languages in the same vector space, enabling cross-lingual NLP tasks.

2. Knowledge-Infused Embeddings

Techniques are emerging to incorporate domain knowledge (e.g., medical or legal knowledge) into embeddings, enhancing their relevance and accuracy.

3. Efficient Transformers

Lightweight models like DistilBERT and ALBERT reduce computational overhead while retaining high performance, making contextual embeddings accessible for resource-constrained applications.

4. Graph-Based Embeddings

Graph embeddings represent entities and their relationships in a graph structure, offering powerful tools for tasks like knowledge graph completion.

5. Beyond Words: Multimodal Embeddings

Combining text with other modalities like images or audio is an exciting frontier. For instance, models like CLIP create embeddings that connect text and visual data.

Final Words of Advice

Word embeddings are the unsung heroes of NLP, quietly powering applications from chatbots to search engines. As you choose a technique for your project, consider:

The scale of your data and compute resources.
The importance of context in your task.
The need for handling rare words or misspellings.

And remember: pre-trained embeddings are often a great starting point, saving you from the heavy lifting of training from scratch.

Reference Section

A Light-Humored Closing

Think of embeddings as the translator between humans and machines, a sort of multilingual diplomat. They don’t just understand “cat” and “dog,” but can also intuitively grasp that “cat memes” belong in the sacred hall of internet humor, while “dog barking” is more of a neighborhood annoyance. And isn’t that what machine learning is all about—bridging worlds, one embedding at a time? 😄

Last updated on February 28, 2025

Feature Selection in NLP and LLMs: Techniques and Best Practices Data Analysis and Feature Scaling in NLP and LLMs: Techniques and Best Practices