NLP Fundamentals: Essential Concepts and Techniques in Natural Language Processing
Raj Shaikh 22 min read 10532 words1. Text Preprocessing & Cleaning
1.1. Understanding Text Normalization in NLP
Text normalization is a crucial preprocessing step in Natural Language Processing (NLP). It involves transforming raw text into a standard and uniform format to make it suitable for further processing. The purpose is to reduce noise and variations in the data while preserving the meaning of the text. This step ensures that downstream tasks like text classification, sentiment analysis, or machine translation perform effectively.
Sub-Contents:
- Lowercasing or Case Normalization
- Removing Punctuation, Special Characters, and Numbers
- Handling Contractions
- Spelling Corrections
A Comprehensive Guide to Text Normalization
1. Lowercasing or Case Normalization Lowercasing converts all text into lowercase to treat words like “Apple” and “apple” as the same. This reduces variations and simplifies text analysis.
Example:
text = "Natural Language Processing is AMAZING!"
normalized_text = text.lower()
print(normalized_text) Output: "natural language processing is amazing!"
Key Point:
Lowercasing is often used in tasks where case sensitivity does not add value, such as sentiment analysis. However, in tasks like Named Entity Recognition (NER), case information might be important.
2. Removing Punctuation, Special Characters, and Numbers Unnecessary characters like punctuation, special symbols, and numbers can introduce noise into the text. Removing them can improve the quality of features derived from text.
Example:
import re
text = "Hello, World! Welcome to 2023. Let's normalize this text."
normalized_text = re.sub(r'[^\w\s]', '', text) Remove punctuation
normalized_text = re.sub(r'\d+', '', normalized_text) Remove numbers
print(normalized_text) Output: "Hello World Welcome to Lets normalize this text"
Key Point:
Be cautious when removing characters. For example, removing apostrophes ('
) might affect contractions, and removing numbers could be problematic if numeric data is relevant (e.g., in financial text analysis).
3. Handling Contractions Contractions like “can’t” and “won’t” are common in informal text. Expanding them into their full forms can improve text analysis and model performance by providing a uniform representation.
Example:
from contractions import fix
text = "I can't believe it's not butter!"
expanded_text = fix(text)
print(expanded_text) Output: "I cannot believe it is not butter!"
Key Point:
Expanding contractions is particularly useful in tasks like sentiment analysis and chatbots, where informal text is prevalent.
4. Spelling Corrections Misspelled words can introduce noise into the data. Correcting spelling ensures consistent representation of words.
Example:
from spellchecker import SpellChecker
spell = SpellChecker()
text = "This is a beutiful example of corecting speling."
corrected_text = " ".join([spell.correction(word) for word in text.split()])
print(corrected_text) Output: "This is a beautiful example of correcting spelling."
Key Point:
Spelling correction is helpful when working with user-generated content, such as social media posts or reviews.
Real-World Applications:
- Search Engines: Improving query matching by normalizing variations in user input.
- Chatbots: Ensuring informal user input is interpreted accurately.
- Sentiment Analysis: Enhancing model accuracy by reducing noise in text.
By implementing these normalization techniques, NLP models can focus on the semantic essence of text, leading to better performance across various tasks.
1.2. Exploring Tokenization in NLP
Tokenization is the process of breaking down text into smaller units, called tokens. These tokens can be words, subwords, or even characters, depending on the level of granularity required. Tokenization is a foundational step in NLP, enabling models to work with textual data efficiently by transforming raw text into a structured format.
Sub-Contents:
- Word-Level Tokenization (e.g., Splitting on Whitespace)
- Subword Tokenization (Byte Pair Encoding, WordPiece, SentencePiece)
- Language-Specific Challenges (e.g., Chinese Segmentation)
A Deep Dive into Tokenization Techniques in NLP
1. Word-Level Tokenization Word-level tokenization splits text into individual words, often based on whitespace or punctuation. It’s a straightforward approach and works well for languages like English, where words are clearly separated by spaces.
Example:
text = "Natural Language Processing is fascinating!"
tokens = text.split() Split on whitespace
print(tokens) Output: ['Natural', 'Language', 'Processing', 'is', 'fascinating!']
Key Point:
While simple and intuitive, word-level tokenization has challenges, such as handling compound words, contractions, and punctuation. For instance, “fascinating!” may need further cleaning.
2. Subword Tokenization Subword tokenization breaks words into smaller units, enabling models to handle rare or unseen words more effectively. It is widely used in modern NLP models like BERT and GPT.
2.1 Byte Pair Encoding (BPE) BPE starts by treating each character as a token and iteratively merging the most frequent pairs of tokens to form subwords.
Example:
from tokenizers import Tokenizer, models
tokenizer = Tokenizer(models.BPE())
tokenizer.train(["natural language processing"], vocab_size=50)
tokens = tokenizer.encode("processing")
print(tokens.tokens) Output: Example: ['pro', 'cess', 'ing']
Key Point:
BPE handles rare words by splitting them into meaningful subword units, reducing the vocabulary size and addressing issues of out-of-vocabulary (OOV) words.
2.2 WordPiece WordPiece is similar to BPE but differs in how it selects which pairs of tokens to merge, focusing on maximizing the likelihood of the training data.
Example:
Used in BERT, WordPiece helps in compressing vocabulary while preserving the ability to represent rare words.
2.3 SentencePiece SentencePiece treats text as a continuous stream, eliminating the need for pre-tokenization (e.g., splitting by whitespace). It uses BPE or unigram language models for subword tokenization.
Example:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("m.model") Assuming a pre-trained SentencePiece model
tokens = sp.encode_as_pieces("Natural Language Processing")
print(tokens) Output: ['▁Natural', '▁Language', '▁Processing']
Key Point:
SentencePiece is especially useful for languages without clear word boundaries, like Chinese or Japanese.
3. Language-Specific Challenges Tokenization becomes complex for languages with unique structures or lack of clear word boundaries.
Chinese Segmentation Chinese text lacks explicit spaces between words, making tokenization challenging. Tools like Jieba or statistical models are often used.
Example:
import jieba
text = "自然语言处理很有趣"
tokens = jieba.lcut(text) Using Jieba for Chinese segmentation
print(tokens) Output: ['自然语言处理', '很', '有趣']
Key Challenges in Other Languages:
- Arabic: Handling diacritics and rich morphology.
- German: Splitting compound words like “Donaudampfschifffahrtsgesellschaftskapitän.”
- Japanese: Similar to Chinese, requiring specialized tokenizers like MeCab.
Real-World Applications:
- Machine Translation: Ensuring token consistency for better alignment across languages.
- Search Engines: Matching queries with documents effectively using subword tokenization.
- Chatbots: Handling user inputs across languages with different tokenization needs.
Tokenization forms the backbone of most NLP tasks. By understanding and applying the appropriate technique, we can ensure that models handle diverse languages and contexts efficiently, leading to better performance and adaptability.
1.3. The Role of Stop Words in NLP
Stop words are common words in a language, such as “the,” “and,” “of,” that typically carry little semantic meaning on their own. Removing them during text preprocessing can help reduce the dimensionality of data and improve computational efficiency. However, their removal depends on the specific NLP task and domain context.
Sub-Contents:
- Common English Stop Words (e.g., “the,” “and,” “of”)
- Domain-Specific Stop Words (Finance, Legal, etc.)
- When and Why to Remove Stop Words (or Not)
Stop Words Removal in NLP: A Practical Guide
1. Common English Stop Words Stop words in English include articles, prepositions, conjunctions, and other frequently used words that do not significantly contribute to the meaning of a sentence.
Example of Common Stop Words:
- Articles: “the,” “a,” “an”
- Conjunctions: “and,” “or,” “but”
- Prepositions: “in,” “on,” “at”
- Others: “is,” “was,” “were”
Example Code:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "The quick brown fox jumps over the lazy dog."
stop_words = set(stopwords.words('english')) Load English stop words
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens) Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
Key Point:
While removing common stop words reduces noise, it might also discard meaningful context in certain applications, like sentiment analysis.
2. Domain-Specific Stop Words In specialized domains, certain words may be irrelevant and treated as stop words, even if they carry meaning in general language.
Examples:
- Finance: “stock,” “price,” “market” (depending on the task)
- Legal: “hereby,” “therefore,” “pursuant”
- Healthcare: “patient,” “symptom,” “treatment”
Example Code:
custom_stop_words = {"market", "stock", "price"} Example for finance
text = "The stock price of the company increased significantly."
tokens = text.split()
filtered_tokens = [word for word in tokens if word.lower() not in custom_stop_words]
print(filtered_tokens) Output: ['The', 'of', 'the', 'company', 'increased', 'significantly.']
Key Point:
Domain-specific stop words should be defined based on the context and task. For example, in stock sentiment analysis, “stock” might not be removed as it can provide crucial context.
3. When and Why to Remove Stop Words (or Not)
When to Remove Stop Words:
- Text Classification: Stop words can add noise without contributing to class differentiation.
- Topic Modeling: Removing them helps identify more meaningful topics.
- Information Retrieval: Enhances search efficiency by ignoring irrelevant terms.
When Not to Remove Stop Words:
- Sentiment Analysis: Words like “not,” “but,” and “however” may convey critical sentiment information.
- Named Entity Recognition (NER): Stop words might provide context for identifying entities.
- Language-Specific Tasks: In languages like Chinese or Japanese, frequent words might hold significance and shouldn’t always be removed.
Illustration of Impact:
Consider the sentence:
“This movie is not good.”
If “not” is removed, the sentiment flips, leading to incorrect conclusions.
Real-World Applications:
- Search Engines: Enhancing retrieval speed by ignoring stop words in queries.
- Chatbots: Optimizing intent recognition by removing irrelevant words.
- Summarization: Focusing on key phrases by filtering out stop words.
Stop word removal is a strategic choice in NLP preprocessing. Understanding the task and context ensures that this preprocessing step enhances rather than hinders model performance.
1.4. Simplifying Text with Stemming and Lemmatization
Stemming and lemmatization are two techniques used in Natural Language Processing (NLP) to reduce words to their root or base form. The goal is to standardize words for efficient processing, particularly in tasks like text classification, sentiment analysis, and search systems. While both techniques serve a similar purpose, they differ in their approaches and outcomes.
Sub-Contents:
- Stemming (Porter, Snowball, Lancaster Stemmer)
- Lemmatization (Using Part-of-Speech Tags for More Accurate Root Forms)
- Trade-Offs: Simplicity vs. Accuracy
A Practical Guide to Stemming and Lemmatization
1. Stemming Stemming is a rule-based process that removes suffixes or prefixes from words to derive their base form, often without regard to the word’s meaning. It is a simpler and faster approach compared to lemmatization.
Popular Stemmer Algorithms:
- Porter Stemmer: Uses heuristic rules for suffix removal.
- Snowball Stemmer: An improved version of Porter Stemmer, supporting multiple languages.
- Lancaster Stemmer: A more aggressive stemmer that often produces shorter stems.
Example Code:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
text = ["running", "jumps", "easily", "studies"]
Porter Stemmer
porter = PorterStemmer()
porter_stems = [porter.stem(word) for word in text]
print(porter_stems) Output: ['run', 'jump', 'easili', 'studi']
Snowball Stemmer
snowball = SnowballStemmer("english")
snowball_stems = [snowball.stem(word) for word in text]
print(snowball_stems) Output: ['run', 'jump', 'easili', 'studi']
Lancaster Stemmer
lancaster = LancasterStemmer()
lancaster_stems = [lancaster.stem(word) for word in text]
print(lancaster_stems) Output: ['run', 'jump', 'easy', 'study']
Key Point:
Stemming is computationally inexpensive but often produces results that are not actual words, which can lead to misinterpretation.
2. Lemmatization Lemmatization reduces words to their dictionary (lemma) form, taking into account the word’s meaning and context. It uses part-of-speech (POS) tags to ensure accuracy.
Example Code:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag
Helper function to convert POS tags
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
text = ["running", "jumps", "easily", "studies"]
lemmatizer = WordNetLemmatizer()
Lemmatize with POS tagging
tokens_pos = pos_tag(text) Add POS tags
lemmatized = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in tokens_pos]
print(lemmatized) Output: ['run', 'jump', 'easily', 'study']
Key Point:
Lemmatization is more accurate than stemming but requires additional resources like POS tagging and word dictionaries, making it computationally more expensive.
3. Trade-Offs: Simplicity vs. Accuracy
Aspect | Stemming | Lemmatization |
---|---|---|
Approach | Rule-based suffix removal | Dictionary-based reduction |
Speed | Fast | Slower |
Output | May not be real words | Always produces valid words |
Accuracy | Less accurate | Highly accurate |
Use Cases | Quick preprocessing | Context-sensitive tasks |
When to Use Stemming:
- When computational resources are limited.
- When precision is not critical, e.g., in exploratory data analysis.
When to Use Lemmatization:
- For applications requiring context and linguistic accuracy.
- Tasks like machine translation or question answering.
Real-World Applications:
- Search Engines: Improving search relevance by normalizing word forms.
- Chatbots: Understanding variations of user queries (e.g., “run,” “running”).
- Sentiment Analysis: Ensuring accurate word representation for sentiment-bearing terms.
By understanding the strengths and limitations of stemming and lemmatization, practitioners can choose the right technique based on task requirements and resource availability.
1.5. Handling Special Cases in Text Preprocessing
Special cases such as URLs, email addresses, hashtags, mentions, emojis, emoticons, accents, and diacritics present unique challenges in text preprocessing. Proper handling of these elements ensures that critical information is preserved or appropriately transformed for downstream tasks.
Sub-Contents:
- URLs, Email Addresses, Hashtags, Mentions
- Emojis and Emoticons
- Accents and Diacritics
Addressing Special Cases in Text Data
1. URLs, Email Addresses, Hashtags, Mentions These elements often appear in social media data, emails, or web content. Depending on the application, they may be removed, replaced, or extracted as features.
Handling URLs:
URLs can either be removed to reduce noise or replaced with a placeholder like <URL>
.
Example Code:
import re
text = "Check out https://example.com for more details!"
text_without_urls = re.sub(r'http\S+', '<URL>', text) Replace URLs with <URL>
print(text_without_urls) Output: "Check out <URL> for more details!"
Handling Email Addresses: Emails can similarly be replaced or extracted.
Example Code:
text = "Contact us at info@example.com for inquiries."
text_without_emails = re.sub(r'\S+@\S+\.\S+', '<EMAIL>', text)
print(text_without_emails) Output: "Contact us at <EMAIL> for inquiries."
Handling Hashtags and Mentions:
Hashtags (``) and mentions (@
) are key elements in social media and can be removed, tokenized, or retained for specific analyses.
Example Code:
text = "Follow us @OpenAI and check AI trends!"
text_without_hashtags_mentions = re.sub(r'[@]\w+', '', text)
print(text_without_hashtags_mentions) Output: "Follow us and check trends!"
Key Point:
Extracting hashtags and mentions can provide metadata for social media sentiment or trend analysis.
2. Emojis and Emoticons Emojis and emoticons often convey sentiment or context in informal text. They can be converted to descriptive text or removed, depending on the use case.
Handling Emojis:
Libraries like emoji
can be used to translate emojis into textual descriptions.
Example Code:
import emoji
text = "I love NLP! 😊🔥"
text_with_emoji_replaced = emoji.demojize(text)
print(text_with_emoji_replaced) Output: "I love NLP! :smiling_face_with_smiling_eyes: :fire:"
Handling Emoticons: Regex can identify common emoticons for removal or replacement.
Example Code:
text = "Great work! :) Keep it up! :("
text_without_emoticons = re.sub(r'[:;][\-]?[)(D]', '<EMOTICON>', text)
print(text_without_emoticons) Output: "Great work! <EMOTICON> Keep it up! <EMOTICON>"
Key Point:
Retaining emojis and emoticons in a structured format can enhance sentiment analysis in informal communication.
3. Accents and Diacritics Accents and diacritics are common in non-English text and can affect text normalization. Depending on the context, these can be removed for uniformity or retained for linguistic integrity.
Removing Accents and Diacritics:
Libraries like unicodedata
can normalize text to remove accents.
Example Code:
import unicodedata
text = "Résumé and naïve are commonly accented words."
text_without_accents = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
print(text_without_accents) Output: "Resume and naive are commonly accented words."
Key Point:
For applications like machine translation, retaining accents is important, while for general NLP tasks like search, removal can reduce noise.
Real-World Applications:
- Social Media Analysis: Extracting and analyzing hashtags, mentions, and emojis for trend tracking or sentiment analysis.
- Email Processing: Preprocessing emails for spam detection or customer service automation.
- Multilingual NLP: Handling accents and diacritics for better tokenization and text matching.
By effectively handling these special cases, text preprocessing can ensure that data is cleaner, more structured, and ready for efficient processing across various NLP tasks.
1.6. Addressing Unicode and Encoding Issues in Text Processing
Unicode and encoding issues arise when text data contains diverse character sets, special symbols, or diacritics, which may not be uniformly represented. Ensuring consistent encoding and normalization is essential for robust and error-free text processing, especially in multilingual or heterogeneous datasets.
Sub-Contents:
- Ensuring Consistent Character Encoding (UTF-8)
- Normalizing Diacritic Forms (NFC vs. NFD)
Handling Unicode and Encoding Issues in Text Data
1. Ensuring Consistent Character Encoding (UTF-8) Character encoding defines how characters are stored and represented. Using a consistent encoding format, like UTF-8, avoids issues such as garbled text or decoding errors.
Common Challenges:
- Mixed encodings in datasets (e.g., UTF-8 and ISO-8859-1).
- Non-standard characters causing errors during processing.
Example Code:
Ensure consistent encoding in file reading
with open("example.txt", "r", encoding="utf-8") as file:
text = file.read()
Encode and decode to ensure consistency
text = text.encode("utf-8").decode("utf-8")
print(text)
Key Point:
Always specify the encoding explicitly when reading or writing files to avoid relying on platform-specific defaults.
2. Normalizing Diacritic Forms (NFC vs. NFD)
Unicode provides multiple ways to represent certain characters. For example, é
can be represented as:
- NFC (Composed Form): A single character (
é
). - NFD (Decomposed Form): A base character (
e
) followed by a diacritic (´
).
Normalizing these forms ensures consistency across text processing pipelines.
NFC vs. NFD Example:
import unicodedata
text = "é"
nfc_form = unicodedata.normalize("NFC", text)
nfd_form = unicodedata.normalize("NFD", text)
print(f"NFC: {nfc_form} | Code points: {[ord(c) for c in nfc_form]}")
Output: NFC: é | Code points: [233]
print(f"NFD: {nfd_form} | Code points: {[ord(c) for c in nfd_form]}")
Output: NFD: é | Code points: [101, 769]
Key Considerations:
- Use NFC for compatibility with most file systems, search engines, and databases.
- Use NFD when further processing requires separating base characters and diacritics (e.g., phonetic analysis).
Converting Between Forms:
text = "Café"
nfc_text = unicodedata.normalize("NFC", text)
nfd_text = unicodedata.normalize("NFD", text)
print(nfc_text) Output: Café
print(nfd_text) Output: Café
Real-World Applications:
- Data Integration: Ensuring consistent encoding across data sources, especially in multilingual datasets.
- Search Engines: Handling diacritics uniformly to improve search results (e.g., treating “résumé” and “resume” equivalently).
- Natural Language Processing: Avoiding errors in tokenization or stemming caused by inconsistent Unicode forms.
By addressing Unicode and encoding issues, you ensure that text data remains consistent, reliable, and ready for processing in diverse NLP workflows.
2. Text Representation Basics
2.1. Understanding the Bag-of-Words (BoW) Model in NLP
The Bag-of-Words (BoW) model is a foundational technique in text representation, where a text is represented as a set of word frequencies or occurrences, disregarding grammar, order, or structure. It is widely used for tasks like text classification, clustering, and feature extraction due to its simplicity and effectiveness.
Sub-Contents:
- Concept and Construction (Count Matrix)
- Vocabulary Building, Handling Rare or Frequent Terms
- Advantages (Simplicity, Interpretability) and Limitations (Ignores Word Order)
The Bag-of-Words (BoW) Model: A Practical Guide
1. Concept and Construction (Count Matrix) In the BoW model:
- Each document is represented as a vector of word counts or binary occurrences.
- The model uses a vocabulary (a set of unique words) derived from the entire corpus.
Steps to Construct BoW:
- Tokenize the text into words.
- Build a vocabulary of unique words.
- Create a count matrix where each row represents a document, and each column represents a word’s frequency or presence (binary).
Example Code:
from sklearn.feature_extraction.text import CountVectorizer
Example corpus
corpus = [
"Natural Language Processing is amazing",
"Bag of Words model is simple and effective",
"Words are the basic building blocks"
]
Construct BoW
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)
Vocabulary and BoW matrix
print("Vocabulary:", vectorizer.get_feature_names_out())
Output: Vocabulary: ['amazing', 'and', 'are', 'bag', 'basic', 'blocks', ...]
print("BoW Matrix:\n", bow_matrix.toarray())
Output: [[1 0 0 0 ...], [0 1 0 1 ...], ...]
2. Vocabulary Building, Handling Rare or Frequent Terms The vocabulary is a critical component in the BoW model. Decisions regarding rare or frequent terms impact the performance of downstream tasks.
Handling Rare Terms:
- Rare terms may increase dimensionality without adding significant value.
- Use a minimum document frequency threshold to exclude rare terms.
Handling Frequent Terms:
- Frequent terms like stop words can dominate the representation and reduce effectiveness.
- Use a maximum document frequency threshold to exclude overly common words.
Example Code:
vectorizer = CountVectorizer(min_df=2, max_df=0.8, stop_words="english")
bow_matrix = vectorizer.fit_transform(corpus)
print("Filtered Vocabulary:", vectorizer.get_feature_names_out())
3. Advantages and Limitations
Advantages:
- Simplicity: Easy to implement and understand.
- Interpretability: Word counts are intuitive to analyze.
- Effectiveness: Works well for text classification and clustering when combined with simple models like Naive Bayes.
Limitations:
- Ignores Word Order: Fails to capture the context or sequence of words (e.g., “not good” vs. “good”).
- Sparse Representation: Large vocabularies lead to sparse matrices with high dimensionality.
- Semantic Loss: Words with similar meanings are treated as independent (e.g., “happy” and “joyful”).
Real-World Applications:
- Text Classification: Spam detection, sentiment analysis, or topic categorization.
- Information Retrieval: Building document-term matrices for search engines.
- Clustering: Grouping similar documents using clustering algorithms.
The Bag-of-Words model remains a popular choice for introductory NLP tasks due to its simplicity and effectiveness. However, for more nuanced text representations, models like TF-IDF or word embeddings can be used to address its limitations.
2.2. Exploring N-Grams in Text Representation
N-grams are contiguous sequences of words or characters in a text, often used in Natural Language Processing (NLP) to capture local context. They serve as an extension of the Bag-of-Words model by incorporating word order and contextual information, balancing simplicity and expressiveness.
Sub-Contents:
- Unigrams, Bigrams, Trigrams, etc.
- Impact on Dimensionality and Capturing Local Context
- Trade-Off Between Capturing More Context vs. Data Sparsity
N-Grams: A Practical Guide to Contextual Text Representation
1. Unigrams, Bigrams, Trigrams, etc.
An n-gram is a sequence of n
words or characters:
- Unigrams: Single words (e.g., “I”, “love”, “NLP”).
- Bigrams: Pairs of consecutive words (e.g., “I love”, “love NLP”).
- Trigrams: Triplets of consecutive words (e.g., “I love NLP”).
Example Code:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"I love Natural Language Processing",
"N-grams are very useful in text representation"
]
Generate bigrams and trigrams
vectorizer = CountVectorizer(ngram_range=(1, 2)) Unigrams and bigrams
ngram_matrix = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
Output: ['i', 'i love', 'love', 'love natural', 'natural', ...]
print("N-gram Matrix:\n", ngram_matrix.toarray())
Output: [[1 1 1 ...], [0 0 1 ...]]
2. Impact on Dimensionality and Capturing Local Context Dimensionality:
- Adding n-grams increases the size of the vocabulary exponentially with
n
. - Higher
n
leads to a more detailed representation but risks sparsity.
Local Context:
- Unigrams: Capture word-level frequency but lose sequence information.
- Bigrams: Add immediate contextual relationships.
- Trigrams and Beyond: Capture more context but often lead to sparsity and overfitting.
Example: For the sentence “I love NLP”:
- Unigrams:
['I', 'love', 'NLP']
- Bigrams:
['I love', 'love NLP']
- Trigrams:
['I love NLP']
3. Trade-Off Between Capturing More Context vs. Data Sparsity
Capturing More Context:
- Higher-order n-grams (e.g., trigrams, 4-grams) capture richer syntactic and semantic relationships.
- Useful in tasks like machine translation or text generation.
Data Sparsity:
- Larger n-grams exponentially increase the number of possible combinations, leading to sparse matrices and increased computational complexity.
- Sparse representations require more data to ensure that n-grams appear frequently enough to be meaningful.
Balancing Trade-Offs:
- Choose n-gram ranges based on task requirements and dataset size.
- Use smoothing techniques or feature selection to handle sparsity.
- Combine n-grams with dimensionality reduction techniques like PCA or Latent Semantic Analysis (LSA).
Real-World Applications:
- Text Classification: Bigrams and trigrams often improve performance by capturing context (e.g., “not good” vs. “good”).
- Machine Translation: Higher-order n-grams model phrase-level semantics.
- Text Generation: Generating coherent sequences in language models.
N-grams bridge the gap between simplicity and context awareness, making them a versatile tool in NLP. Careful selection of n
and preprocessing techniques can maximize their effectiveness while mitigating issues of sparsity and dimensionality.
2.3. TF-IDF (Term Frequency–Inverse Document Frequency)
TF-IDF (Term Frequency–Inverse Document Frequency) is a numerical statistic used to represent the importance of a word in a document relative to a collection of documents (corpus). It enhances simple word frequency models by accounting for the rarity or commonness of words, making it especially useful in tasks like information retrieval and text classification.
Sub-Contents:
- Equation and Interpretation of TF, IDF, and TF-IDF
- Why IDF Helps Weigh Down Common Words (Like Stop Words)
- Practical Considerations (Log Scaling, Smoothing)
A Comprehensive Guide to TF-IDF
1. Equation and Interpretation of TF, IDF, and TF-IDF
Term Frequency (TF): Measures how often a term appears in a document.
\[ TF(t, d) = \frac{\text{Count of } t \text{ in } d}{\text{Total terms in } d} \]Inverse Document Frequency (IDF): Measures how unique a term is across the corpus.
\[ IDF(t, D) = \log\left(\frac{N}{1 + \text{DF}(t)}\right) \]Where:
- \( N \): Total number of documents.
- \( \text{DF}(t) \): Number of documents containing the term \( t \).
- Adding \( 1 \) in the denominator avoids division by zero.
TF-IDF: Combines TF and IDF to give a weighted importance of a term in a document.
\[ TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D) \]Interpretation:
- High TF-IDF: Indicates a term is frequent in a document but rare in the corpus.
- Low TF-IDF: Indicates a term is either common across the corpus or infrequent in the document.
Example Code:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"Natural Language Processing is amazing",
"TF-IDF helps highlight unique terms",
"TF-IDF is widely used in NLP"
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
Vocabulary and TF-IDF scores
print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
2. Why IDF Helps Weigh Down Common Words (Like Stop Words)
Problem with Common Words: Words like “the,” “is,” or “and” occur frequently across documents but carry little discriminative power.
Role of IDF:
- IDF assigns lower weights to terms that appear in many documents.
- This diminishes the impact of common words and amplifies the significance of rare, informative terms.
Example:
If a term \( t \) appears in all documents, \( \text{DF}(t) = N \).
Thus, the TF-IDF score for such terms approaches zero.
3. Practical Considerations (Log Scaling, Smoothing)
Log Scaling:
- Helps mitigate the impact of very high term frequencies.
- Without scaling, terms with high TF can disproportionately dominate.
Smoothing IDF:
- Adding \( 1 \) to \( \text{DF}(t) \) avoids division by zero for terms that do not appear in any document.
- This ensures numerical stability and prevents undefined values.
Sublinear TF Scaling:
- Applies log scaling to TF values: \[ TF(t, d) = 1 + \log(\text{Raw count of } t \text{ in } d) \]
- Useful for reducing the impact of highly repetitive words in a single document.
Example Code with Smoothing:
vectorizer = TfidfVectorizer(smooth_idf=True, sublinear_tf=True)
tfidf_matrix = vectorizer.fit_transform(corpus)
print("Vocabulary with smoothing:", vectorizer.get_feature_names_out())
print("Smoothed TF-IDF Matrix:\n", tfidf_matrix.toarray())
Real-World Applications:
- Information Retrieval: Ranking documents based on relevance to a query.
- Text Classification: Creating feature vectors for machine learning models.
- Keyword Extraction: Identifying important terms in documents.
TF-IDF remains one of the most effective and interpretable methods for text representation. By balancing term frequency with rarity, it provides a robust way to highlight meaningful words in a corpus while downplaying common terms.
2.4. Word Embeddings
Word embeddings are dense vector representations of words that capture their semantic meanings based on context. Unlike traditional methods like Bag-of-Words or TF-IDF, embeddings encode relationships between words in a continuous vector space, enabling models to understand linguistic and semantic nuances.
Sub-Contents:
- Word2Vec (Skip-gram, CBOW): How They Learn Context-Based Embeddings
- GloVe (Global Vectors for Word Representation): Using Global Co-occurrence Statistics
- fastText: Subword Embeddings for Out-of-Vocabulary Handling
- Dimensionality, Semantic Relationships, and Analogy Tasks
Word Embeddings: Contextual Representations for NLP
1. Word2Vec Word2Vec, introduced by Mikolov et al., uses neural networks to learn word embeddings based on context. There are two main architectures:
1.1 Skip-gram Model:
- Predicts the context words given a target word.
- Objective: Maximize the probability of context words around a target word.
1.2 Continuous Bag of Words (CBOW):
- Predicts the target word given its context words.
- Objective: Maximize the probability of the target word based on surrounding words.
Example Code:
from gensim.models import Word2Vec
Example corpus
sentences = [["I", "love", "natural", "language", "processing"],
["Word embeddings", "are", "powerful", "tools"]]
Train Word2Vec using Skip-gram
model = Word2Vec(sentences, vector_size=50, window=2, sg=1) sg=1 for Skip-gram
Retrieve the vector for a word
vector = model.wv["love"]
print("Vector for 'love':", vector)
2. GloVe (Global Vectors for Word Representation) GloVe is a matrix factorization-based model that creates embeddings by leveraging global co-occurrence statistics of words in a corpus.
Core Idea:
- Words appearing in similar contexts have similar embeddings.
- Co-occurrence matrix \( X_{ij} \): Counts of word \( j \) appearing in the context of word \( i \).
- Objective function minimizes the difference between the dot product of word embeddings and the logarithm of co-occurrence counts:
Where \( f(X_{ij}) \) is a weighting function to handle sparsity.
Example Code:
Example: Using pre-trained GloVe vectors
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50") Load 50-dimensional GloVe vectors
vector = model["love"]
print("Vector for 'love':", vector)
3. fastText fastText, developed by Facebook, extends Word2Vec by representing words as the sum of their subword (character n-gram) embeddings. This approach:
- Handles out-of-vocabulary (OOV) words by composing embeddings from subwords.
- Improves performance for morphologically rich languages.
Example Code:
from gensim.models import FastText
Train fastText model
model = FastText(sentences, vector_size=50, window=2, min_n=3, max_n=6)
Get vector for OOV word
vector = model.wv["unknownword"]
print("Vector for 'unknownword':", vector)
4. Dimensionality, Semantic Relationships, and Analogy Tasks Word embeddings are typically low-dimensional (e.g., 50–300 dimensions), capturing:
- Semantic relationships: Words with similar meanings are closer in the vector space (e.g., “dog” and “cat”).
- Syntactic relationships: Words with similar grammatical roles are near each other (e.g., “walk” and “walked”).
Analogy Tasks: Word embeddings encode arithmetic relationships. For example:
\[ \text{"king"} - \text{"man"} + \text{"woman"} \approx \text{"queen"} \]Example Code for Analogies:
Using GloVe model
result = model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
print("Result for analogy (king - man + woman):", result)
Output: [('queen', 0.8)]
Advantages and Applications:
- Dimensionality Reduction: Dense vectors significantly reduce the space required compared to one-hot encoding.
- Semantic Insights: Enable understanding of word meanings and relationships.
- Downstream Tasks: Used in text classification, clustering, and similarity computations.
By capturing rich semantic and syntactic relationships, word embeddings revolutionized NLP, paving the way for contextual embeddings like BERT and GPT.
2.5. Out-of-Vocabulary Words
Out-of-Vocabulary (OOV) words are terms that do not appear in the training vocabulary of a model, posing challenges for traditional word embedding methods. Addressing OOV words is critical for handling rare words, new terms, or morphologically rich languages.
Sub-Contents:
- How Classic Embeddings Handle Unknown or Rare Words
- Using Subword Approaches to Mitigate OOV Issues
Managing Out-of-Vocabulary Words in Word Embeddings
1. How Classic Embeddings Handle Unknown or Rare Words
Classic word embedding methods like Word2Vec and GloVe require a fixed vocabulary built from the training corpus. Words outside this vocabulary (OOV words) are problematic because they have no precomputed embeddings.
Challenges with Classic Embeddings:
- Unknown Token: OOV words are often replaced with a generic placeholder like
<UNK>
, leading to loss of semantic information. - Sparse Vocabulary: Rare or infrequent words may not be included in the vocabulary due to frequency thresholds.
- No Morphological Generalization: These embeddings do not capture relationships between morphologically similar words (e.g., “run” vs. “running”).
Example of OOV Issue:
from gensim.models import Word2Vec
Example corpus and model
sentences = [["I", "love", "NLP"], ["Word2Vec", "is", "awesome"]]
model = Word2Vec(sentences, vector_size=50, min_count=2) min_count excludes rare words
Accessing OOV word
try:
vector = model.wv["unknown"]
except KeyError:
print("Word is out-of-vocabulary!") Output: Word is out-of-vocabulary!
2. Using Subword Approaches to Mitigate OOV Issues
Subword-based models address the limitations of classic embeddings by breaking words into smaller components, such as character n-grams. This allows the model to generate embeddings for OOV words by composing their subword representations.
2.1 fastText: Subword Embeddings
- fastText extends Word2Vec by learning embeddings for character n-grams and composing them to represent words.
- Handles OOV words by summing the embeddings of their constituent n-grams.
Example Code:
from gensim.models import FastText
Train fastText model
sentences = [["I", "love", "NLP"], ["Word embeddings", "are", "useful"]]
model = FastText(sentences, vector_size=50, window=3, min_count=1, min_n=3, max_n=6)
Vector for an OOV word
vector = model.wv["unseenword"]
print("Embedding for 'unseenword':", vector)
2.2 Byte Pair Encoding (BPE)
- Used in subword tokenizers like SentencePiece or Byte-Pair Encoding (BPE).
- Breaks words into smaller, frequent subunits, enabling embeddings for rare or compound words.
Example Using BPE Tokenization:
import sentencepiece as spm
Assuming a SentencePiece model has been trained
sp = spm.SentencePieceProcessor(model_file="bpe_model.model")
Tokenize OOV word
tokens = sp.encode("unseenword", out_type=str)
print("Subword tokens:", tokens) Output: ['un', 'seen', 'word']
2.3 Contextual Embeddings:
- Models like BERT and GPT embed words in context, dynamically generating embeddings for OOV words based on their surrounding text.
Example with BERT:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
Tokenize and compute embeddings
inputs = tokenizer("unseenword in context", return_tensors="pt")
outputs = model(**inputs)
print("Embedding for 'unseenword':", outputs.last_hidden_state)
Advantages of Subword Approaches:
- Handles OOV Words: Enables representation for previously unseen terms.
- Morphological Awareness: Captures relationships between root words and their variants (e.g., “run” and “running”).
- Reduced Vocabulary Size: Limits the need for massive vocabularies.
Real-World Applications:
- Social Media Analysis: Handling hashtags, abbreviations, and new terms.
- Language Translation: Managing rare or compound words in low-resource languages.
- Text Classification: Ensuring robust feature representation in noisy datasets.
Subword-based approaches like fastText, BPE, and contextual embeddings have significantly mitigated the OOV problem, allowing NLP systems to handle diverse and evolving text data effectively.
3. Key Action Points & Best Practices
3.1. Choosing the Right Preprocessing Steps
Text preprocessing is a critical phase in NLP, as the choices made here can significantly impact the performance of downstream tasks. Deciding which preprocessing steps to use depends on the specific requirements and goals of the task. Among the common questions are when to use stemming versus lemmatization and whether to remove stop words.
Sub-Contents:
- When Do You Need Stemming vs. Lemmatization?
- Should You Remove Stop Words for All Tasks (Like Sentiment Analysis)?
Making Informed Choices in Text Preprocessing
1. When Do You Need Stemming vs. Lemmatization?
Both stemming and lemmatization aim to reduce words to their base or root forms, but the choice depends on the task’s sensitivity to linguistic correctness.
Stemming:
- Use When:
- Computational efficiency is critical.
- The task tolerates approximate root forms (e.g., topic modeling or search engines).
- Example Use Case: In document retrieval, stemming helps match variations of a word without requiring perfect grammatical correctness (e.g., “run” matches “running” or “runner”).
Example Code for Stemming:
from nltk.stem import PorterStemmer
text = ["running", "ran", "runs"]
stemmer = PorterStemmer()
print([stemmer.stem(word) for word in text]) Output: ['run', 'ran', 'run']
Lemmatization:
- Use When:
- Accuracy and linguistic precision matter.
- Tasks require distinguishing between syntactically different words (e.g., sentiment analysis or machine translation).
- Example Use Case: In sentiment analysis, lemmatization ensures correct interpretation of words like “better” (as the comparative form of “good”) instead of stemming it to an unrelated base.
Example Code for Lemmatization:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
text = [("running", "v"), ("better", "a")]
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(word, pos=pos) for word, pos in text])
Output: ['run', 'good']
2. Should You Remove Stop Words for All Tasks (Like Sentiment Analysis)?
Stop word removal is a common preprocessing step but is not universally beneficial. Its applicability depends on the task’s reliance on the contextual or syntactic significance of stop words.
When to Remove Stop Words:
- Text Classification/Topic Modeling: Stop words often add noise without providing discriminatory value.
- Example Use Case: In spam detection, removing “the,” “is,” or “and” reduces noise and focuses on domain-specific keywords.
Example Code for Removing Stop Words:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "This is an example sentence for preprocessing."
stop_words = set(stopwords.words("english"))
filtered_text = [word for word in word_tokenize(text) if word.lower() not in stop_words]
print(filtered_text) Output: ['example', 'sentence', 'preprocessing']
When to Keep Stop Words:
- Sentiment Analysis: Stop words like “not,” “but,” and “although” often carry significant sentiment-modifying information.
- Example Use Case: In the sentence “This movie is not good,” removing “not” would invert the sentiment.
Hybrid Approach:
- For certain tasks, remove non-sentiment-modifying stop words while retaining crucial ones like “not” or “very.”
- Example: Define a custom stop word list based on task requirements.
Example Code for Hybrid Stop Word Removal:
custom_stop_words = set(stopwords.words("english")) - {"not", "very"}
filtered_text = [word for word in word_tokenize(text) if word.lower() not in custom_stop_words]
print(filtered_text)
Real-World Applications:
- Search Engines: Stemming improves query matching by reducing word variations.
- Sentiment Analysis: Retaining stop words ensures nuanced sentiment expressions are preserved.
- Topic Modeling: Removing stop words highlights meaningful keywords for clustering.
Summary:
- Stemming is ideal for speed and approximate matches, while lemmatization ensures linguistic precision.
- Stop word removal is task-dependent: use it for reducing noise in classification tasks, but retain important modifiers for sentiment or contextual understanding.
By tailoring preprocessing steps to the specific task, you ensure a balance between computational efficiency and the preservation of meaningful information.
3.2. Balancing Vocabulary Size & Coverage
Building a vocabulary is a critical step in NLP, as it determines how text is represented and processed. However, there is a trade-off between having a large vocabulary to maximize coverage and the risk of data sparsity, which can negatively impact model performance. Understanding and managing these trade-offs ensures efficient and effective text processing.
Sub-Contents:
- Trade-Offs in Building a Big Vocabulary vs. Risk of Data Sparsity
- Use of Frequency Thresholds and Handling Rare Tokens
Strategies for Balancing Vocabulary Size and Coverage
1. Trade-Offs in Building a Big Vocabulary vs. Risk of Data Sparsity
Big Vocabulary:
- Advantages:
- High coverage: Captures more unique words and rare terms.
- Useful for tasks requiring domain-specific or nuanced terms (e.g., legal or medical NLP).
- Disadvantages:
- Increased dimensionality: Leads to sparse representations, making models computationally expensive.
- Overfitting risk: Rare or noisy words can introduce irrelevant features.
Small Vocabulary:
- Advantages:
- Lower dimensionality: Reduces computational cost and sparsity.
- Generalization: Focuses on frequent terms, reducing noise.
- Disadvantages:
- Loss of information: Excludes rare but meaningful words.
- Limited flexibility: Fails to handle OOV words effectively.
Illustration of Vocabulary Size Trade-Offs:
- Big Vocabulary: Includes terms like “genomics,” “enzymes,” but risks sparsity in a general corpus.
- Small Vocabulary: Focuses on common terms like “health,” “study,” but loses domain-specific context.
2. Use of Frequency Thresholds and Handling Rare Tokens
Frequency Thresholds:
- Set minimum and maximum document frequencies to filter out rare and overly frequent terms.
Example:
- Minimum Frequency: Remove terms appearing in fewer than 2 documents.
- Maximum Frequency: Remove terms appearing in over 80% of documents (e.g., stop words).
Example Code for Frequency Filtering:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"Natural language processing is fascinating",
"Language models are improving daily",
"Deep learning and language are interconnected"
]
vectorizer = CountVectorizer(min_df=2, max_df=0.8) Filter rare and frequent terms
bow_matrix = vectorizer.fit_transform(corpus)
print("Filtered Vocabulary:", vectorizer.get_feature_names_out())
Handling Rare Tokens:
-
Use Special Tokens:
- Replace rare terms with a generic
<UNK>
token to limit vocabulary size. - Helps models generalize by grouping unseen words under a single token.
Example Code:
rare_words = {"genomics", "enzymes"} text = "The study of genomics and enzymes is complex." processed_text = " ".join([word if word not in rare_words else "<UNK>" for word in text.split()]) print(processed_text) Output: "The study of <UNK> and <UNK> is complex."
- Replace rare terms with a generic
-
Subword Approaches:
- Methods like Byte Pair Encoding (BPE) and fastText decompose words into subwords, enabling embeddings for rare or unseen words.
- Reduces the need for large vocabularies while maintaining coverage.
Example Using SentencePiece:
import sentencepiece as spm sp = spm.SentencePieceProcessor(model_file="bpe_model.model") tokens = sp.encode("genomics", out_type=str) print("Subword tokens:", tokens) Output: ['gen', 'omics']
-
Cluster Rare Words:
- Cluster rare words based on semantic similarity or context to reduce vocabulary size without losing meaning.
Real-World Applications:
- Search Engines: Use small vocabularies with frequency thresholds to improve query matching efficiency.
- Domain-Specific NLP: Employ large vocabularies with subword methods for fields like medical or legal text processing.
- Text Classification: Balance vocabulary size to avoid sparsity in feature representations.
Summary:
- Big Vocabulary: Ensures high coverage but risks sparsity and inefficiency.
- Small Vocabulary: Reduces noise and computational cost but may lose domain-specific nuances.
- Optimal Strategy: Combine frequency thresholds, subword methods, and clustering to balance size and coverage effectively.
Tailoring vocabulary size to the task and dataset ensures robust text representations while maintaining computational efficiency.
3.3. Evaluating Representations
The choice of text representation significantly impacts the performance of NLP tasks. Techniques like Bag-of-Words (BoW), TF-IDF, and word embeddings each have strengths and weaknesses. Evaluating these representations for a specific task involves comparing their effectiveness based on interpretability, computational efficiency, and model performance.
Sub-Contents:
- How to Compare BoW, TF-IDF, and Word Embeddings for a Given Task
- Potential for Combining Multiple Approaches (e.g., TF-IDF + Embeddings)
Evaluating and Combining Text Representations in NLP
1. How to Compare BoW, TF-IDF, and Word Embeddings for a Given Task
Criteria for Comparison:
Criteria | BoW | TF-IDF | Word Embeddings |
---|---|---|---|
Dimensionality | High (sparse vectors) | High (sparse vectors) | Low (dense vectors) |
Context Sensitivity | None | None | Context-aware (with pretrained embeddings) |
Semantic Awareness | None | Partial (term importance) | Strong (captures word relationships) |
Interpretability | High | High | Moderate |
Computational Cost | Low | Moderate | High (especially for training embeddings) |
Task-Specific Comparison:
-
Text Classification (e.g., spam detection):
- BoW and TF-IDF: Perform well for simple tasks where the exact term occurrence or importance is critical.
- Word Embeddings: Provide richer semantic information, especially for nuanced classification (e.g., sarcasm detection).
- Comparison:
- Use TF-IDF when the importance of rare terms matters.
- Use embeddings for tasks requiring understanding of synonyms or semantic relationships.
-
Sentiment Analysis:
- BoW and TF-IDF: Struggle to capture negations or nuanced expressions (e.g., “not bad”).
- Word Embeddings: Context-aware embeddings excel due to their ability to consider relationships and surrounding context.
- Example: Pretrained models like BERT outperform traditional methods in sentiment analysis.
-
Topic Modeling:
- BoW and TF-IDF: Effective in clustering documents into topics, especially in models like LDA.
- Word Embeddings: Can be combined with dimensionality reduction (e.g., PCA) to create dense topic vectors.
- Recommendation:
- Start with TF-IDF for topic modeling and consider embeddings for deeper semantic insights.
Evaluation Metrics:
- Classification Tasks: Accuracy, precision, recall, F1-score.
- Clustering/Topic Modeling: Coherence score, silhouette score.
- Semantic Similarity Tasks: Cosine similarity, Spearman/Pearson correlation with human-annotated data.
Example Evaluation Pipeline for Text Classification:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
Example dataset
corpus = ["I love this product", "This is bad", "Amazing quality", "Not great"]
labels = [1, 0, 1, 0] Sentiment labels: 1 = positive, 0 = negative
Split data
X_train, X_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.2)
BoW representation
bow_vectorizer = CountVectorizer()
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)
TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
Train and evaluate a model
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)
predictions = model.predict(X_test_tfidf)
print("Accuracy with TF-IDF:", accuracy_score(y_test, predictions))
2. Potential for Combining Multiple Approaches
Why Combine Approaches?
- Different representations capture different aspects of text (e.g., term frequency vs. semantics).
- Combining features can enhance model performance.
Example: TF-IDF + Embeddings
- Concatenate TF-IDF vectors with word embeddings to create hybrid representations.
- This approach leverages the interpretability of TF-IDF and the semantic richness of embeddings.
Example Code for Combining Representations:
import numpy as np
Generate TF-IDF features
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus).toarray()
Generate Word2Vec embeddings
word2vec = Word2Vec([doc.split() for doc in corpus], vector_size=50, window=2, min_count=1)
X_embeddings = np.array([np.mean([word2vec.wv[word] for word in doc.split() if word in word2vec.wv] or [np.zeros(50)], axis=0) for doc in corpus])
Combine features
X_combined = np.hstack([X_tfidf, X_embeddings])
print("Shape of combined representation:", X_combined.shape)
Real-World Applications:
- Text Classification: Combine TF-IDF for term frequency and embeddings for semantic context.
- Sentiment Analysis: Use embeddings to handle nuances and combine with TF-IDF for rare word importance.
- Topic Modeling: Pair TF-IDF with embedding-based clustering for dense topic representations.
Summary:
- BoW: Simple and interpretable, suitable for basic tasks.
- TF-IDF: Highlights important terms, ideal for tasks where frequency matters.
- Word Embeddings: Capture rich semantics, best for complex or nuanced tasks.
- Hybrid Approaches: Combine methods to balance simplicity, interpretability, and semantic richness.
The best representation depends on task requirements, dataset characteristics, and available computational resources.
3.4. Practical Considerations
When building or selecting text representations, various practical challenges come into play, such as memory constraints, the use of pre-trained embeddings versus custom training, and domain adaptation for specialized tasks. Careful planning can help mitigate these challenges while optimizing for accuracy and efficiency.
Sub-Contents:
- Memory Constraints for Large Vocabularies
- Pre-Trained Embeddings (e.g., GloVe, Word2Vec) vs. Custom Training
- Domain Adaptation (Specialized Text in Medical or Financial Domains)
Addressing Practical Challenges in Text Representation
1. Memory Constraints for Large Vocabularies
Challenges:
- Large vocabularies increase memory usage and computational complexity.
- Sparse representations like BoW and TF-IDF can lead to inefficient storage.
Solutions:
-
Use Subword Models:
- Reduce vocabulary size by breaking words into subword units using methods like Byte Pair Encoding (BPE) or SentencePiece.
Example:
import sentencepiece as spm sp = spm.SentencePieceProcessor(model_file="bpe.model") tokens = sp.encode("unseenword", out_type=str) print("Subword tokens:", tokens) Output: ['un', 'seen', 'word']
-
Apply Frequency Thresholds:
- Exclude rare or overly common terms to limit vocabulary size.
Example Code:
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(min_df=2, max_df=0.8) Adjust thresholds
-
Use Dimensionality Reduction:
- Apply techniques like Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) to reduce feature dimensions.
2. Pre-Trained Embeddings (e.g., GloVe, Word2Vec) vs. Custom Training
Pre-Trained Embeddings:
- Advantages:
- Save time and resources.
- Provide robust semantic representations for general language use.
- Well-suited for tasks in domains where text resembles general corpora (e.g., social media or news).
- Disadvantages:
- May not capture domain-specific nuances (e.g., medical or legal terms).
Custom Training:
- Advantages:
- Tailored to the specific corpus, capturing specialized vocabulary and context.
- Enables domain-specific applications.
- Disadvantages:
- Computationally expensive and requires large, high-quality corpora.
- Risks overfitting on small datasets.
Recommendation:
- Use pre-trained embeddings as a baseline and fine-tune them on the domain-specific corpus if necessary.
Example Code:
from gensim.models import Word2Vec, KeyedVectors
Load pre-trained Word2Vec model
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
Fine-tune on custom corpus
custom_corpus = [["domain", "specific", "text"]]
custom_model = Word2Vec(custom_corpus, size=300, window=5, min_count=1)
custom_model.build_vocab([list(model.vocab.keys())], update=True)
custom_model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, lockf=1.0)
3. Domain Adaptation (Specialized Text in Medical or Financial Domains)
Challenges:
- General embeddings fail to capture the domain-specific meanings of terms (e.g., “depression” in mental health vs. economics).
- Rare terms and unique phrases dominate specialized text.
Solutions:
-
Fine-Tuning Pre-Trained Models:
- Fine-tune general embeddings on a domain-specific corpus to incorporate specialized knowledge.
Example: Fine-tune BERT on medical literature for tasks like clinical diagnosis.
-
Domain-Specific Pre-Trained Models:
- Use pre-trained models built on domain-specific corpora (e.g., BioBERT for biomedical texts, FinBERT for financial texts).
Example Code with Domain-Specific BERT:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") inputs = tokenizer("Patient shows symptoms of hypertension", return_tensors="pt") outputs = model(**inputs)
-
Building a Custom Corpus:
- Collect and preprocess domain-specific data for training custom embeddings or fine-tuning existing models.
-
Embedding Augmentation:
- Combine domain-specific embeddings with general embeddings for tasks requiring broad and specialized knowledge.
Real-World Applications:
- Medical NLP: Use domain-specific embeddings (e.g., BioBERT) for tasks like entity recognition or relation extraction in clinical data.
- Financial NLP: Fine-tune embeddings on financial news or reports for sentiment analysis or risk prediction.
- Legal NLP: Build custom embeddings for contract analysis or legal document classification.
Summary:
- Memory Constraints: Address with subword models, frequency thresholds, and dimensionality reduction.
- Pre-Trained vs. Custom Training: Use pre-trained embeddings for general tasks and fine-tune for domain-specific contexts.
- Domain Adaptation: Leverage domain-specific models or fine-tuning to handle specialized vocabulary and semantics effectively.
By strategically managing vocabulary size, leveraging pre-trained embeddings, and fine-tuning for domain adaptation, you can optimize NLP workflows for diverse applications.
4. Beyond the Fundamentals (Optional Extensions)
4.1. Subword Tokenization for Deep Learning
Subword tokenization is a powerful method used in NLP to address the challenges of out-of-vocabulary (OOV) words and capture morphological features effectively. Techniques like Byte Pair Encoding (BPE), WordPiece, and SentencePiece enable models to break words into smaller units, such as prefixes, suffixes, or character n-grams, ensuring efficient and flexible text representation.
Sub-Contents:
- Byte Pair Encoding (BPE), WordPiece, SentencePiece
- Reducing Out-of-Vocabulary Issues and Capturing Morphological Features
Subword Tokenization Techniques for Deep Learning
1. Byte Pair Encoding (BPE)
Concept:
- BPE is a data compression algorithm adapted for subword tokenization.
- Initially treats each character as a token, then iteratively merges the most frequent pairs of tokens to form new subwords.
Advantages:
- Captures common prefixes, suffixes, and roots (e.g., “un-”, “-ing”).
- Reduces vocabulary size while retaining the ability to represent rare words.
Steps in BPE:
- Start with characters as the initial vocabulary.
- Merge the most frequent adjacent pairs of symbols.
- Repeat until the desired vocabulary size is reached.
Example Code:
from tokenizers import Tokenizer, models, trainers
Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=1000)
Train on a sample corpus
corpus = ["subword tokenization improves representation"]
tokenizer.train_from_iterator(corpus, trainer)
output = tokenizer.encode("tokenization")
print("Subword tokens:", output.tokens) Example: ['token', 'ization']
2. WordPiece
Concept:
- Used by models like BERT, WordPiece is similar to BPE but differs in how it selects token pairs to merge.
- Optimizes the likelihood of the training corpus rather than just frequency.
Advantages:
- Balances vocabulary size and coverage.
- Captures meaningful subword units better for semantic tasks.
Example Code:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("tokenization")
print("WordPiece tokens:", tokens) Example: ['token', 'ization']
Key Feature:
- Uses the "" prefix to denote subwords that are not the beginning of a word.
3. SentencePiece
Concept:
- SentencePiece is a subword tokenization library that treats input text as a continuous stream, avoiding the need for pre-tokenization (e.g., splitting by spaces).
- Supports models like GPT and T5.
Advantages:
- Works well for languages without clear word boundaries (e.g., Chinese, Japanese).
- Flexible: Supports BPE and unigram language models.
Example Code:
import sentencepiece as spm
Train a SentencePiece model
spm.SentencePieceTrainer.train(input='data.txt', model_prefix='sp', vocab_size=1000)
sp = spm.SentencePieceProcessor(model_file='sp.model')
Tokenize text
tokens = sp.encode("tokenization is crucial", out_type=str)
print("SentencePiece tokens:", tokens) Example: ['token', 'ization', 'is', 'crucial']
4. Reducing Out-of-Vocabulary Issues and Capturing Morphological Features
Reducing OOV Issues:
- Subword tokenization splits rare or unseen words into smaller units that are likely part of the vocabulary.
- Example: The word “tokenization” may be split into “token” and “ization,” ensuring it is not treated as OOV.
Capturing Morphological Features:
- Subword models retain meaningful prefixes, suffixes, and roots.
- Example: In “unhappiness,” subwords like “un-”, “happi-”, and “-ness” capture morphological structure.
Why Subword Tokenization is Effective:
- Handling Rare Words: Allows even rare or new words to be represented as a combination of subwords.
- Morphological Awareness: Encodes semantic relationships between morphologically similar words.
- Reduced Vocabulary Size: Achieves compact and flexible representations without sacrificing expressiveness.
Real-World Applications:
- Machine Translation: Handles rare or unseen words in target languages.
- Language Modeling: Improves performance in morphologically rich languages.
- Named Entity Recognition (NER): Captures partial matches for rare or compound names.
Summary:
- Byte Pair Encoding (BPE): Merges frequent character pairs, focusing on frequency.
- WordPiece: Optimizes subword vocabulary for semantic tasks using likelihood maximization.
- SentencePiece: Offers flexibility and eliminates the need for pre-tokenization.
- Subword tokenization significantly reduces OOV issues and captures morphological features, making it essential for robust NLP models.
4.2. Contextual Embeddings
Contextual embeddings represent a major leap forward in NLP by capturing the meaning of words based on their context. Unlike static embeddings like Word2Vec and GloVe, which assign a single vector to each word, contextual embeddings provide dynamic representations that vary depending on the surrounding text.
Sub-Contents:
- Brief Mention of ELMo, BERT, GPT as Next-Level NLP Approaches
- Difference Between Static Embeddings (Word2Vec, GloVe) and Contextual Embeddings
Contextual Embeddings: A Preview of Advanced NLP
1. Brief Mention of ELMo, BERT, GPT as Next-Level NLP Approaches
ELMo (Embeddings from Language Models):
- Developed by AllenNLP, ELMo generates word embeddings based on entire sentences.
- Outputs different embeddings for the same word in different contexts.
- Example: The word “bank” in “river bank” and “financial bank” gets distinct embeddings.
BERT (Bidirectional Encoder Representations from Transformers):
- Developed by Google, BERT uses a bidirectional Transformer to capture the context of a word from both its left and right sides.
- Excels in tasks requiring nuanced understanding, such as question answering and sentiment analysis.
- Pre-trained on large corpora and fine-tuned for specific tasks.
GPT (Generative Pre-trained Transformer):
- Focuses on text generation by predicting the next token in a sequence.
- Autoregressive model (unidirectional), but later versions like GPT-3 capture broader contexts effectively.
- Well-suited for creative and conversational tasks.
2. Difference Between Static Embeddings (Word2Vec, GloVe) and Contextual Embeddings
Aspect | Static Embeddings (Word2Vec, GloVe) | Contextual Embeddings (ELMo, BERT, GPT) |
---|---|---|
Word Representation | Single vector per word | Dynamic vector based on context |
Context Sensitivity | None | High (captures sentence-level nuances) |
Training Approach | Trained on word co-occurrence | Trained on sentences or sequences |
Use Cases | Basic tasks (e.g., text classification) | Advanced tasks (e.g., QA, summarization, NER) |
Computational Cost | Low | High (requires powerful hardware for training) |
Example of Contextual Difference: Static embeddings:
- “bank” always has the same vector regardless of context.
Contextual embeddings:
- In ELMo/BERT:
- “bank” in “river bank” → Embedding represents the geographical meaning.
- “bank” in “financial bank” → Embedding represents the financial meaning.
Code Comparison: Static Embedding (Word2Vec):
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
vector = model["bank"] Same vector for all contexts
Contextual Embedding (BERT):
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("The river bank is beautiful.", return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state) Context-dependent embeddings
Why Contextual Embeddings Matter:
- Dynamic Representations: Adjust to the specific context, enabling more accurate understanding.
- Sentence-Level Nuances: Capture relationships and meanings within a sentence.
- Advanced Applications: Power state-of-the-art models in tasks like machine translation, summarization, and conversational AI.
Summary:
- Static Embeddings (Word2Vec, GloVe): Assign fixed vectors to words, limited to capturing overall semantic similarity.
- Contextual Embeddings (ELMo, BERT, GPT): Generate context-sensitive vectors, capturing dynamic meanings and relationships.
- Contextual embeddings represent the evolution of NLP, enabling models to understand language more deeply and flexibly.
4.3. Sparse vs. Dense Representations
Representing text as vectors is fundamental to NLP. Sparse representations like Bag-of-Words (BoW) and TF-IDF focus on frequency-based features, while dense representations, such as neural embeddings, use compact continuous vectors to encode semantic information. Choosing between these depends on the task, interpretability needs, and computational constraints.
Sub-Contents:
- Sparsity of Bag-of-Words and TF-IDF Vectors
- Dense Continuous Vectors in Neural Embeddings
- Pros and Cons for Interpretability and Performance
Comparing Sparse and Dense Representations in NLP
1. Sparsity of Bag-of-Words and TF-IDF Vectors
Sparse Representations:
- Represent text as high-dimensional vectors where most elements are zero.
- Bag-of-Words (BoW):
- A vector indicating the frequency of each word in a vocabulary.
- Example: “I love NLP” → [1, 1, 1, 0, 0, …]
- TF-IDF:
- A weighted version of BoW, emphasizing rare words and reducing the impact of common words.
Sparsity Example:
- Vocabulary size: 10,000 words.
- Document: 10 unique words.
- BoW vector: 10 non-zero values, 9,990 zeros.
Code Example for TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I love NLP", "NLP is amazing"]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
2. Dense Continuous Vectors in Neural Embeddings
Dense Representations:
- Represent words as low-dimensional vectors, typically 50–300 dimensions.
- Neural Embeddings:
- Word2Vec, GloVe, fastText, and contextual embeddings (e.g., BERT, GPT).
- Encode semantic relationships and contextual meanings.
Dense Vector Example:
- Word “love” → [0.5, -0.2, 0.8, …]
Code Example for Word Embeddings:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
vector = model["love"]
print("Dense vector for 'love':", vector)
3. Pros and Cons for Interpretability and Performance
Aspect | Sparse Representations | Dense Representations |
---|---|---|
Dimensionality | High (equal to vocabulary size) | Low (fixed dimensions, e.g., 300) |
Sparsity | Most elements are zero | Compact and fully populated vectors |
Interpretability | High (easily map to specific words) | Low (vectors are abstract representations) |
Semantic Information | Minimal (no semantic relationships) | Rich (captures similarity and context) |
Performance | Computationally efficient for simple models | Higher performance in complex tasks |
Use Cases | Text classification, topic modeling | Sentiment analysis, language understanding |
Choosing the Right Representation:
-
When to Use Sparse Representations:
- BoW or TF-IDF is suitable for:
- Small datasets where overfitting is a concern.
- Applications requiring interpretability (e.g., understanding keyword relevance).
- Example: Topic modeling using Latent Dirichlet Allocation (LDA).
- BoW or TF-IDF is suitable for:
-
When to Use Dense Representations:
- Neural Embeddings excel for:
- Large datasets requiring deep semantic understanding.
- Complex tasks like sentiment analysis, named entity recognition, or machine translation.
- Example: Pretrained BERT for question answering.
- Neural Embeddings excel for:
-
Hybrid Approaches:
- Combine sparse (TF-IDF) and dense (embeddings) features for enhanced performance.
- Example Code for Combining Representations:
import numpy as np Generate TF-IDF and dense features tfidf_vector = tfidf_matrix.toarray() dense_vector = model["NLP"] Example dense vector Combine features combined_features = np.hstack([tfidf_vector, dense_vector]) print("Combined features shape:", combined_features.shape)
Real-World Applications:
- Search Engines: Sparse representations like TF-IDF for quick and interpretable results.
- Chatbots: Dense embeddings for understanding user intent in conversations.
- Document Classification: Hybrid approaches to balance interpretability and performance.
Summary:
- Sparse Representations: High interpretability, suitable for simple and interpretable models.
- Dense Representations: Rich semantic capture, enabling state-of-the-art performance in advanced tasks.
- Balancing interpretability and performance often requires careful selection or hybridization of these methods, depending on the task and dataset.
4.4. Domain-Specific Vocabularies
Domain-specific vocabularies enhance NLP systems by tailoring text preprocessing and representation to the unique characteristics of specialized fields like finance, legal, or healthcare. Customizing stop words, synonyms, and dictionaries ensures that the system captures relevant context and semantics while filtering out noise.
Sub-Contents:
- Tailoring Stop Words
- Incorporating Domain-Specific Synonyms
- Using Specialized Dictionaries
Customizing Vocabularies for Domain-Specific NLP
1. Tailoring Stop Words
In domain-specific contexts, generic stop word lists (e.g., “the,” “and,” “is”) may need customization:
- Remove domain-relevant terms from stop word lists to preserve their meaning.
- Add high-frequency, contextually irrelevant words as domain-specific stop words.
Example:
- Generic Stop Words: “the,” “and,” “is.”
- Finance-Specific Additions: “company,” “shareholder,” “market.”
- Legal-Specific Additions: “hereby,” “whereas,” “therefore.”
Code Example:
from nltk.corpus import stopwords
Customize stop words for finance
generic_stop_words = set(stopwords.words("english"))
finance_stop_words = generic_stop_words.union({"market", "stock", "price"})
Filter stop words from text
text = "The stock price of the company is rising."
tokens = text.split()
filtered_tokens = [word for word in tokens if word.lower() not in finance_stop_words]
print(filtered_tokens) Output: ['The', 'rising.']
2. Incorporating Domain-Specific Synonyms
Synonyms play a significant role in NLP by aligning variations of terms to a consistent representation.
Example:
- Healthcare Synonyms: “myocardial infarction” ↔ “heart attack.”
- Legal Synonyms: “plaintiff” ↔ “complainant.”
- Finance Synonyms: “equities” ↔ “stocks.”
Applications:
- Synonym normalization ensures that different terms for the same concept are treated uniformly.
- Helps in search engines, chatbots, and information retrieval.
Code Example:
Define a synonym mapping
synonym_dict = {"heart attack": "myocardial infarction", "stocks": "equities"}
Normalize synonyms in text
text = "The patient suffered a heart attack."
normalized_text = " ".join([synonym_dict.get(word, word) for word in text.split()])
print(normalized_text) Output: "The patient suffered a myocardial infarction."
3. Using Specialized Dictionaries
Specialized dictionaries capture domain-specific terminology, jargon, and abbreviations. These are crucial for:
- Accurate entity recognition.
- Domain-specific classification and clustering.
- Context-aware sentiment analysis.
Examples of Specialized Dictionaries:
- Healthcare: ICD-10 codes, SNOMED-CT terms.
- Legal: Black’s Law Dictionary for legal definitions.
- Finance: Stock ticker symbols, industry-specific terms.
Code Example for Healthcare Dictionary:
Example dictionary for medical terms
medical_dict = {"MI": "myocardial infarction", "HTN": "hypertension"}
Replace abbreviations in text
text = "The patient has HTN and MI."
expanded_text = " ".join([medical_dict.get(word, word) for word in text.split()])
print(expanded_text) Output: "The patient has hypertension and myocardial infarction."
Benefits of Domain-Specific Vocabularies:
- Improved Relevance: Captures domain-specific nuances and avoids losing critical terms.
- Enhanced Accuracy: Reduces noise by filtering irrelevant terms.
- Consistency: Normalizes synonymous and abbreviated terms, aligning them to consistent representations.
Real-World Applications:
- Healthcare NLP:
- Medical entity recognition (e.g., diseases, symptoms, treatments).
- Clinical note summarization and diagnosis prediction.
- Finance NLP:
- Sentiment analysis of earnings reports or financial news.
- Stock price prediction based on news or social media.
- Legal NLP:
- Contract analysis for key clause extraction.
- Case law retrieval and summarization.
Summary:
- Tailored Stop Words: Refine stop word lists to remove irrelevant yet frequent terms specific to the domain.
- Domain Synonyms: Align synonymous terms for consistent and accurate processing.
- Specialized Dictionaries: Use domain-specific resources to enhance context-aware text representation.
- Custom vocabularies significantly improve the accuracy, relevance, and interpretability of NLP systems in specialized fields.
5. NLP Text Representation Cheat Sheet
1. Text Normalization
- Lowercasing: Convert text to lowercase for uniformity.
- Remove Punctuation/Numbers: Remove non-textual elements for noise reduction.
- Handling Contractions: Expand “can’t” → “cannot” for consistency.
- Spelling Correction: Fix typos using libraries like
spellchecker
.
2. Tokenization
- Word-Level: Split text by spaces/punctuation (e.g., “I love NLP”).
- Subword Tokenization:
- BPE: Merge frequent character pairs (e.g., “tokenization” → [“token”, “ization”]).
- WordPiece: Optimizes likelihood (used in BERT).
- SentencePiece: No pre-tokenization; handles languages without spaces.
- Language-Specific Challenges: Use tools like Jieba for Chinese segmentation.
3. Stop Words Removal
- Generic Stop Words: “the,” “and,” “is.”
- Domain-Specific Stop Words: Add/remove based on domain (e.g., “market” in finance).
- When to Keep: Tasks like sentiment analysis (e.g., “not” impacts meaning).
4. Stemming & Lemmatization
- Stemming: Reduces words to approximate root (e.g., “running” → “run”). Faster but less accurate.
- Lemmatization: Converts to dictionary root using POS tags (e.g., “better” → “good”). Slower but precise.
5. Bag-of-Words (BoW)
- Represents text as a vector of word counts.
- Advantages: Simple, interpretable.
- Limitations: Ignores word order, high dimensionality.
6. N-Grams
- Sequences of
n
words:- Unigrams: [“I”, “love”].
- Bigrams: [“I love”].
- Trigrams: [“I love NLP”].
- Trade-Off: Higher
n
captures more context but increases sparsity.
7. TF-IDF
- Formula:
- TF: \( \text{Term count in doc / Total terms in doc} \).
- IDF: \( \log(\text{Total docs / Docs with term}) \).
- Use Case: Highlights rare but important words.
8. Word Embeddings
- Word2Vec: Predicts context (Skip-gram) or word (CBOW).
- GloVe: Captures global co-occurrence statistics.
- fastText: Embeds subwords for OOV handling.
- Analogy Task: “king” - “man” + “woman” = “queen.”
9. Out-of-Vocabulary (OOV) Handling
- Classic Methods: Replace with
<UNK>
token. - Subword Models: BPE, SentencePiece, fastText split rare words into subunits.
10. Sparse vs. Dense Representations
- Sparse Representations:
- BoW, TF-IDF.
- High dimensionality, interpretable.
- Dense Representations:
- Word embeddings, BERT.
- Low dimensionality, captures semantics.
11. Domain-Specific Vocabularies
- Tailored Stop Words: Add/remove based on domain (e.g., “market” in finance).
- Synonyms: Align terms (e.g., “equities” ↔ “stocks”).
- Specialized Dictionaries: Use resources like medical ontologies (e.g., SNOMED-CT).
12. Subword Tokenization
- BPE: Merges frequent character pairs.
- WordPiece: Used in BERT, optimizes likelihood.
- SentencePiece: Handles languages without spaces.
13. Contextual Embeddings
- ELMo: Contextualized word embeddings based on sentences.
- BERT: Bi-directional context for NLP tasks.
- GPT: Unidirectional model for text generation.
- Difference from Static Embeddings:
- Contextual embeddings vary based on usage (e.g., “bank” in “river bank” vs. “financial bank”).
14. Practical Considerations
- Memory Constraints: Use subword models, frequency thresholds.
- Pre-Trained vs. Custom Embeddings:
- Pre-trained saves time but lacks domain-specific nuances.
- Fine-tune for domain adaptation.
- Domain Adaptation:
- Use BioBERT (healthcare), FinBERT (finance).
- Expand vocabularies with specialized terms.
Quick Tips
- Use lemmatization over stemming for accuracy.
- Combine TF-IDF + embeddings for hybrid representations.
- Choose dense embeddings for semantic-rich tasks (e.g., sentiment analysis).
- Fine-tune contextual embeddings for domain-specific NLP.
This cheat sheet provides a concise overview of fundamental text preprocessing and representation techniques to optimize NLP workflows!