LLM Mini Projects: Hands-On Applications of Large Language Models
Raj Shaikh 31 min read 6438 words1. RAG System
1.1. Encoder
Let’s start with the Encoder module for a Retrieval-Augmented Generation (RAG) system. The Encoder is responsible for converting text data (documents or queries) into vector representations (embeddings) that are used for efficient similarity matching during retrieval.
Here’s a step-by-step plan for the Encoder module:
Key Components of the Encoder:
- Pre-trained Language Model: Use a model like
sentence-transformers
,HuggingFace Transformers
, or similar, to encode documents into embeddings. - Input Handling: Manage inputs (e.g., document text, query text).
- Vectorization: Generate fixed-size dense embeddings.
- Storage Integration: Optionally save embeddings for later retrieval using vector databases like Pinecone, Weaviate, or FAISS.
Let’s write the code for this module.
Code: Encoder Module
Import required libraries
from sentence_transformers import SentenceTransformer
import numpy as np
import os
import json
class Encoder:
"""
Encoder class for converting text data into dense vector representations.
This module uses a pre-trained SentenceTransformer model to generate embeddings for
documents or queries, enabling efficient similarity-based retrieval.
"""
def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
"""
Initialize the Encoder with a pre-trained model.
Parameters:
- model_name (str): The name of the pre-trained model to load from SentenceTransformers.
"""
self.model_name = model_name
self.model = SentenceTransformer(model_name)
print(f"Encoder initialized with model: {self.model_name}")
def encode(self, texts: list, batch_size: int = 32) -> np.ndarray:
"""
Convert a list of text documents into dense vector embeddings.
Parameters:
- texts (list): A list of text strings to encode.
- batch_size (int): Batch size for encoding to optimize speed and memory usage.
Returns:
- np.ndarray: A 2D array where each row corresponds to the embedding of a text.
"""
if not isinstance(texts, list) or not all(isinstance(text, str) for text in texts):
raise ValueError("Input 'texts' must be a list of strings.")
embeddings = self.model.encode(texts, batch_size=batch_size, show_progress_bar=True)
return np.array(embeddings)
def save_embeddings(self, embeddings: np.ndarray, output_path: str):
"""
Save embeddings to a file for later retrieval.
Parameters:
- embeddings (np.ndarray): The embeddings to save.
- output_path (str): The path to save the embeddings file.
"""
os.makedirs(os.path.dirname(output_path), exist_ok=True)
np.save(output_path, embeddings)
print(f"Embeddings saved to: {output_path}")
def load_embeddings(self, input_path: str) -> np.ndarray:
"""
Load embeddings from a saved file.
Parameters:
- input_path (str): The path of the saved embeddings file.
Returns:
- np.ndarray: The loaded embeddings.
"""
if not os.path.exists(input_path):
raise FileNotFoundError(f"Embeddings file not found: {input_path}")
embeddings = np.load(input_path)
print(f"Embeddings loaded from: {input_path}")
return embeddings
def save_texts(self, texts: list, output_path: str):
"""
Save the original texts corresponding to the embeddings for reference.
Parameters:
- texts (list): The list of text documents.
- output_path (str): The path to save the texts file (as JSON).
"""
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(texts, f, ensure_ascii=False, indent=4)
print(f"Texts saved to: {output_path}")
def load_texts(self, input_path: str) -> list:
"""
Load the original texts from a saved file.
Parameters:
- input_path (str): The path of the saved texts file.
Returns:
- list: The list of text documents.
"""
if not os.path.exists(input_path):
raise FileNotFoundError(f"Texts file not found: {input_path}")
with open(input_path, 'r', encoding='utf-8') as f:
texts = json.load(f)
print(f"Texts loaded from: {input_path}")
return texts
Example usage
if __name__ == "__main__":
Initialize encoder
encoder = Encoder()
Sample data
documents = [
"Machine learning is fascinating.",
"Natural Language Processing is a key area of AI.",
"RAG combines retrieval and generation for better results."
]
Generate embeddings
embeddings = encoder.encode(documents)
Save embeddings and texts
encoder.save_embeddings(embeddings, "data/embeddings.npy")
encoder.save_texts(documents, "data/texts.json")
Load embeddings and texts
loaded_embeddings = encoder.load_embeddings("data/embeddings.npy")
loaded_texts = encoder.load_texts("data/texts.json")
print("Loaded embeddings shape:", loaded_embeddings.shape)
print("Loaded texts:", loaded_texts)
Explanation of the Code
-
Initialization:
- The
Encoder
class initializes with a pre-trained model fromsentence-transformers
. - Default model:
'all-MiniLM-L6-v2'
(lightweight and efficient).
- The
-
Encoding:
- The
encode
method accepts a list of strings and generates embeddings using the pre-trained model. - Batch processing ensures memory-efficient encoding.
- The
-
Saving and Loading:
- Embeddings and their corresponding texts are saved to disk for later use. Embeddings are saved in
.npy
format (NumPy arrays), while texts are saved as JSON.
- Embeddings and their corresponding texts are saved to disk for later use. Embeddings are saved in
-
Error Handling:
- Input validation ensures that the input is a list of strings.
- File existence checks handle cases where the user tries to load missing files.
-
Example Workflow:
- Sample documents are encoded, saved, and reloaded to demonstrate the module’s functionality.
1.2. Vector DB
Now that the Encoder module is ready, let’s move forward and integrate it with a vector database for efficient similarity search. We’ll use FAISS (Facebook AI Similarity Search), which is a popular open-source library for efficient similarity search and clustering of dense embeddings.
Key Steps for Vector Database Integration
-
FAISS Setup:
- Use FAISS to store and search embeddings.
- Configure an index for similarity search (e.g.,
IndexFlatL2
for cosine similarity).
-
Index Management:
- Add encoded embeddings to the FAISS index.
- Perform similarity queries on the index.
-
Persistence:
- Save the FAISS index to disk for reuse.
- Load the index when needed.
-
Integration:
- Combine FAISS with the Encoder module to create a seamless retrieval pipeline.
Code: FAISS Integration
import faiss
import numpy as np
class VectorDatabase:
"""
Vector Database using FAISS for similarity search.
"""
def __init__(self, embedding_dim: int):
"""
Initialize the FAISS vector database.
Parameters:
- embedding_dim (int): The dimensionality of the embeddings.
"""
self.embedding_dim = embedding_dim
self.index = faiss.IndexFlatL2(embedding_dim) L2 distance for similarity search
print(f"FAISS index initialized with embedding dimension: {self.embedding_dim}")
def add_embeddings(self, embeddings: np.ndarray):
"""
Add embeddings to the FAISS index.
Parameters:
- embeddings (np.ndarray): A 2D array of embeddings to add.
"""
if embeddings.shape[1] != self.embedding_dim:
raise ValueError("Embedding dimension does not match the index configuration.")
self.index.add(embeddings)
print(f"Added {embeddings.shape[0]} embeddings to the index.")
def search(self, query_embeddings: np.ndarray, k: int = 5):
"""
Search for the top-k similar embeddings.
Parameters:
- query_embeddings (np.ndarray): A 2D array of query embeddings.
- k (int): Number of nearest neighbors to retrieve.
Returns:
- distances (np.ndarray): Distances of the top-k neighbors.
- indices (np.ndarray): Indices of the top-k neighbors.
"""
if query_embeddings.shape[1] != self.embedding_dim:
raise ValueError("Query embedding dimension does not match the index configuration.")
distances, indices = self.index.search(query_embeddings, k)
return distances, indices
def save_index(self, file_path: str):
"""
Save the FAISS index to a file.
Parameters:
- file_path (str): Path to save the FAISS index.
"""
faiss.write_index(self.index, file_path)
print(f"FAISS index saved to: {file_path}")
def load_index(self, file_path: str):
"""
Load a FAISS index from a file.
Parameters:
- file_path (str): Path to load the FAISS index from.
"""
if not os.path.exists(file_path):
raise FileNotFoundError(f"FAISS index file not found: {file_path}")
self.index = faiss.read_index(file_path)
print(f"FAISS index loaded from: {file_path}")
Example usage
if __name__ == "__main__":
from sentence_transformers import SentenceTransformer
Initialize encoder and database
encoder = Encoder()
documents = [
"Machine learning is fascinating.",
"Natural Language Processing is a key area of AI.",
"RAG combines retrieval and generation for better results.",
"Deep learning is a subset of machine learning.",
"Artificial Intelligence is transforming industries."
]
Encode documents
embeddings = encoder.encode(documents)
Initialize FAISS database
vector_db = VectorDatabase(embedding_dim=embeddings.shape[1])
vector_db.add_embeddings(embeddings)
Save index
vector_db.save_index("data/faiss_index.bin")
Load index
vector_db.load_index("data/faiss_index.bin")
Perform a query
query = ["What is machine learning?"]
query_embeddings = encoder.encode(query)
distances, indices = vector_db.search(query_embeddings, k=3)
print("\nQuery:", query[0])
print("\nTop 3 similar documents:")
for i, idx in enumerate(indices[0]):
print(f"{i + 1}. {documents[idx]} (Distance: {distances[0][i]:.4f})")
Explanation of the Code
-
Initialization:
- The
VectorDatabase
class initializes a FAISS index usingIndexFlatL2
, which computes the L2 distance (euclidean distance) for similarity.
- The
-
Adding Embeddings:
- The
add_embeddings
method accepts a NumPy array of embeddings and adds them to the index.
- The
-
Querying:
- The
search
method takes query embeddings and returns the top-k
similar embeddings from the index, along with their distances and indices.
- The
-
Persistence:
save_index
andload_index
handle saving and loading of the FAISS index to/from disk.
-
Example Workflow:
- Embeddings for sample documents are encoded and added to the FAISS index.
- The index is saved and reloaded for demonstration.
- A sample query is encoded, and the most similar documents are retrieved.
Output Example
For the query "What is machine learning?"
, the output might look like this:
Query: What is machine learning?
Top 3 similar documents:
1. Machine learning is fascinating. (Distance: 0.2374)
2. Deep learning is a subset of machine learning. (Distance: 0.3498)
3. Artificial Intelligence is transforming industries. (Distance: 0.5821)
1.3. Retrieval
Now, let’s integrate the retrieval process, where we use the FAISS index to fetch documents most relevant to a query. This module will bridge the Encoder and the Generator components in a RAG system.
Retrieval Integration Plan
-
Retrieve Documents:
- Use the FAISS index to find the most relevant documents for a given query.
- Return both the text of the documents and the similarity scores.
-
Document Storage:
- Use a simple mechanism to store and retrieve documents (e.g., JSON or a database).
-
Combine Retrieval and Encoding:
- Encode the query using the Encoder module.
- Use the FAISS index to retrieve relevant documents based on embeddings.
-
Output:
- Return a structured response containing relevant documents and their similarity scores.
Code: Retrieval Integration
class Retriever:
"""
Retriever class for fetching the most relevant documents using FAISS.
"""
def __init__(self, encoder: Encoder, vector_db: VectorDatabase, document_store_path: str):
"""
Initialize the Retriever with an Encoder, VectorDatabase, and Document Store.
Parameters:
- encoder (Encoder): An instance of the Encoder class.
- vector_db (VectorDatabase): An instance of the VectorDatabase class.
- document_store_path (str): Path to the stored documents (JSON file).
"""
self.encoder = encoder
self.vector_db = vector_db
self.document_store_path = document_store_path
self.documents = self.load_documents()
def load_documents(self):
"""
Load documents from the document store.
Returns:
- list: A list of documents loaded from the JSON file.
"""
if not os.path.exists(self.document_store_path):
raise FileNotFoundError(f"Document store file not found: {self.document_store_path}")
with open(self.document_store_path, 'r', encoding='utf-8') as f:
documents = json.load(f)
print(f"Loaded {len(documents)} documents from: {self.document_store_path}")
return documents
def retrieve(self, query: str, k: int = 5):
"""
Retrieve the top-k relevant documents for a query.
Parameters:
- query (str): The query string.
- k (int): Number of top documents to retrieve.
Returns:
- list of dict: A list of dictionaries containing 'document' and 'score'.
"""
Encode the query
query_embedding = self.encoder.encode([query])
Search the vector database
distances, indices = self.vector_db.search(query_embedding, k)
Collect the results
results = []
for i, idx in enumerate(indices[0]):
document = self.documents[idx]
score = distances[0][i]
results.append({"document": document, "score": score})
return results
Example usage
if __name__ == "__main__":
Initialize encoder and FAISS vector database
encoder = Encoder()
vector_db = VectorDatabase(embedding_dim=384) Use the correct embedding dimension
Initialize retriever with a document store
retriever = Retriever(encoder, vector_db, "data/texts.json")
Perform retrieval for a query
query = "Explain machine learning concepts."
top_documents = retriever.retrieve(query, k=3)
print("\nQuery:", query)
print("\nTop 3 retrieved documents:")
for i, result in enumerate(top_documents):
print(f"{i + 1}. Document: {result['document']} (Score: {result['score']:.4f})")
Explanation of the Code
-
Retriever Initialization:
- Combines the
Encoder
andVectorDatabase
modules. - Loads documents from a JSON file (
document_store_path
) to map retrieved indices to the actual document texts.
- Combines the
-
Document Loading:
- Loads a pre-saved document list in JSON format.
- This document list corresponds to the embeddings stored in the FAISS index.
-
Retrieval:
- Encodes the query using the Encoder.
- Performs a FAISS search to find the top-
k
relevant embeddings. - Maps the retrieved indices to the actual document texts.
-
Structured Output:
- Returns a list of dictionaries, each containing a document and its similarity score.
Example Workflow
-
Store Documents: Ensure documents are saved in
data/texts.json
:[ "Machine learning is fascinating.", "Natural Language Processing is a key area of AI.", "RAG combines retrieval and generation for better results.", "Deep learning is a subset of machine learning.", "Artificial Intelligence is transforming industries." ]
-
Perform Retrieval: For the query
"Explain machine learning concepts."
, the output might look like this:Query: Explain machine learning concepts. Top 3 retrieved documents: 1. Document: Machine learning is fascinating. (Score: 0.2374) 2. Document: Deep learning is a subset of machine learning. (Score: 0.3498) 3. Document: Artificial Intelligence is transforming industries. (Score: 0.5821)
1.4. Generator
Now, we will integrate the Generator module, the final component of the Retrieval-Augmented Generation (RAG) system. The Generator takes the retrieved documents and the query as input to generate a context-aware, relevant response.
Key Steps for the Generator Module
-
Language Model:
- Use a pre-trained model like OpenAI’s GPT, HuggingFace’s Transformers (e.g.,
gpt-neo
,t5
), or similar. - Fine-tuning is optional but not necessary for a basic RAG system.
- Use a pre-trained model like OpenAI’s GPT, HuggingFace’s Transformers (e.g.,
-
Input Preparation:
- Combine the query and retrieved documents into a format suitable for the model (e.g., concatenated text).
-
Response Generation:
- Use the language model to generate a response based on the input context.
-
Integration:
- Combine the Generator with the Retriever to form the end-to-end pipeline.
Code: Generator Module
from transformers import AutoModelForCausalLM, AutoTokenizer
class Generator:
"""
Generator class for producing context-aware responses using a language model.
"""
def __init__(self, model_name: str = 'gpt2'):
"""
Initialize the Generator with a pre-trained language model.
Parameters:
- model_name (str): The name of the pre-trained model to load.
"""
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
print(f"Generator initialized with model: {self.model_name}")
def generate_response(self, query: str, retrieved_docs: list, max_length: int = 150) -> str:
"""
Generate a response using the query and retrieved documents as context.
Parameters:
- query (str): The input query.
- retrieved_docs (list): List of retrieved documents (strings).
- max_length (int): Maximum length of the generated response.
Returns:
- str: The generated response.
"""
Prepare the input text by combining query and context
context = "\n".join(retrieved_docs)
input_text = f"Context:\n{context}\n\nQuery: {query}\n\nAnswer:"
Tokenize input
inputs = self.tokenizer.encode(input_text, return_tensors='pt', truncation=True)
Generate response
outputs = self.model.generate(
inputs,
max_length=max_length,
num_return_sequences=1,
no_repeat_ngram_size=2,
early_stopping=True
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
return response
Example usage
if __name__ == "__main__":
Initialize retriever and generator
encoder = Encoder()
vector_db = VectorDatabase(embedding_dim=384)
retriever = Retriever(encoder, vector_db, "data/texts.json")
generator = Generator(model_name='gpt2')
Perform retrieval
query = "Explain the importance of machine learning."
top_documents = retriever.retrieve(query, k=3)
Extract document texts
retrieved_docs = [result["document"] for result in top_documents]
Generate a response
response = generator.generate_response(query, retrieved_docs)
print("\nGenerated Response:")
print(response)
Explanation of the Code
-
Initialization:
- The
Generator
class uses a pre-trained language model from HuggingFace’s Transformers library (default:gpt2
). - The tokenizer and model are loaded during initialization.
- The
-
Input Preparation:
- Retrieved documents are concatenated with the query to form a context for the generator.
- The format is designed to guide the model in generating a coherent response.
-
Response Generation:
- The
generate_response
method tokenizes the input, feeds it into the model, and decodes the output. - Parameters like
max_length
,no_repeat_ngram_size
, andearly_stopping
ensure high-quality, concise responses.
- The
-
Pipeline Integration:
- Combined with the Retriever, the Generator produces responses using real-time retrieval as context.
Example Workflow
Query
"Explain the importance of machine learning."
Retrieved Documents
1. Machine learning is fascinating.
2. Deep learning is a subset of machine learning.
3. Artificial Intelligence is transforming industries.
Generated Response
Machine learning is a critical area of artificial intelligence that allows systems to learn and adapt from data without being explicitly programmed. It underpins advancements in deep learning and drives transformative applications across industries.
End-to-End Pipeline Assembly
To create the full RAG pipeline, you can integrate the Encoder, Retriever, and Generator into a single class or script. Here’s a basic structure:
class RAGPipeline:
"""
End-to-End Retrieval-Augmented Generation (RAG) pipeline.
"""
def __init__(self, encoder: Encoder, vector_db: VectorDatabase, retriever: Retriever, generator: Generator):
"""
Initialize the RAG pipeline with all components.
"""
self.encoder = encoder
self.vector_db = vector_db
self.retriever = retriever
self.generator = generator
def answer_query(self, query: str, top_k: int = 3, max_length: int = 150):
"""
Generate an answer to a query using the RAG pipeline.
Parameters:
- query (str): The input query.
- top_k (int): Number of top documents to retrieve.
- max_length (int): Maximum length of the generated response.
Returns:
- str: The generated response.
"""
Retrieve relevant documents
top_documents = self.retriever.retrieve(query, k=top_k)
retrieved_docs = [result["document"] for result in top_documents]
Generate response
response = self.generator.generate_response(query, retrieved_docs, max_length=max_length)
return response
Example usage
if __name__ == "__main__":
Initialize components
encoder = Encoder()
vector_db = VectorDatabase(embedding_dim=384)
retriever = Retriever(encoder, vector_db, "data/texts.json")
generator = Generator(model_name='gpt2')
Create RAG pipeline
rag_pipeline = RAGPipeline(encoder, vector_db, retriever, generator)
Answer a query
query = "What is the impact of AI on industries?"
response = rag_pipeline.answer_query(query)
print("\nFinal Response:")
print(response)
1.5. Deployment
Let’s proceed with deploying the RAG pipeline. The goal is to make the pipeline accessible via a web API, where users can submit queries and receive responses in real time.
Deployment Plan
-
Framework Selection:
- Use a lightweight, efficient web framework like FastAPI for creating and serving the API.
-
API Design:
- Define an endpoint to accept user queries and return generated responses.
- Optionally add endpoints for health checks or additional metadata.
-
Integration:
- Load and initialize the Encoder, Retriever, Vector Database, and Generator within the API.
-
Run Server:
- Serve the API locally or deploy it to a platform like AWS, Google Cloud, or Heroku.
Code: RAG API with FastAPI
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
Initialize FastAPI app
app = FastAPI()
Define input and output schemas
class QueryRequest(BaseModel):
query: str
top_k: int = 3 Number of documents to retrieve
class QueryResponse(BaseModel):
query: str
response: str
retrieved_documents: list
Initialize RAG components
encoder = Encoder()
vector_db = VectorDatabase(embedding_dim=384)
retriever = Retriever(encoder, vector_db, "data/texts.json")
generator = Generator(model_name="gpt2")
rag_pipeline = RAGPipeline(encoder, vector_db, retriever, generator)
@app.get("/")
def health_check():
"""
Health check endpoint to verify the API is running.
"""
return {"status": "OK", "message": "RAG API is running."}
@app.post("/query", response_model=QueryResponse)
def answer_query(request: QueryRequest):
"""
Endpoint to handle user queries and return RAG responses.
Parameters:
- query (str): The input query.
- top_k (int): Number of top documents to retrieve.
Returns:
- QueryResponse: Contains the query, generated response, and retrieved documents.
"""
try:
Retrieve top documents
top_documents = rag_pipeline.retriever.retrieve(request.query, k=request.top_k)
retrieved_docs = [result["document"] for result in top_documents]
Generate response
response = rag_pipeline.generator.generate_response(request.query, retrieved_docs)
return QueryResponse(
query=request.query,
response=response,
retrieved_documents=retrieved_docs,
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Run the server with: uvicorn script_name:app --reload
Explanation of the Code
-
FastAPI Initialization:
- The
FastAPI
app is initialized, and routes are defined to handle requests.
- The
-
Endpoint Definitions:
- Health Check (
/
):- Confirms the API is up and running.
- Query Endpoint (
/query
):- Accepts a query and retrieves top documents.
- Generates a response using the RAG pipeline.
- Returns the query, generated response, and retrieved documents.
- Health Check (
-
Input and Output Validation:
- Pydantic models (
QueryRequest
andQueryResponse
) ensure structured input and output, enabling easier debugging and testing.
- Pydantic models (
-
Pipeline Integration:
- The RAG pipeline components (
Encoder
,Retriever
,VectorDatabase
, andGenerator
) are initialized and used within the API.
- The RAG pipeline components (
-
Error Handling:
- Exceptions are caught and returned with a
500
status code, making the API more robust.
- Exceptions are caught and returned with a
Running the Server
-
Save the script as
rag_api.py
. -
Run the server locally using Uvicorn:
uvicorn rag_api:app --reload
-
Test the API:
- Open the interactive API documentation at: http://127.0.0.1:8000/docs
- Submit a query through the
/query
endpoint.
Example Request and Response
Request:
{
"query": "What is machine learning?",
"top_k": 3
}
Response:
{
"query": "What is machine learning?",
"response": "Machine learning is a method of data analysis that automates analytical model building...",
"retrieved_documents": [
"Machine learning is fascinating.",
"Deep learning is a subset of machine learning.",
"Artificial Intelligence is transforming industries."
]
}
Deployment Options
-
Local Deployment:
- Test the API locally using Uvicorn.
-
Cloud Deployment:
- Deploy to platforms like AWS, Google Cloud, Azure, or Heroku.
- Use Docker for containerization if needed.
-
Scaling:
- Integrate a load balancer for high traffic.
- Use GPU instances for faster inference in production.
2. Information Retriever System
2.1. Project Setup
Here’s the code for Part 1: Project Setup and Requirements, with in-depth documentation and explanations.
Code for Project Setup and Requirements
"""
Information Retrieval System - Project Setup
--------------------------------------------
This script initializes the project environment and installs necessary dependencies.
It also sets up the basic folder structure for the project.
"""
import os
import subprocess
import sys
Function to install required packages
def install_packages(packages):
"""
Installs the given Python packages using pip.
Args:
packages (list): A list of package names to install.
Returns:
None
"""
for package in packages:
subprocess.check_call([sys.executable, "-m", "pip", "install", package])
Step 1: Install required Python packages
REQUIRED_PACKAGES = [
"numpy", For numerical computations
"pandas", For data manipulation
"nltk", For text preprocessing
"spacy", For advanced NLP processing
"transformers", For working with pre-trained models (Hugging Face)
"faiss-cpu", For efficient vector search
"elasticsearch", For full-text search and indexing
]
print("Installing required packages...")
install_packages(REQUIRED_PACKAGES)
print("All packages installed successfully!")
Step 2: Set up project directory structure
PROJECT_DIRS = [
"data", For storing raw datasets
"preprocessed", For storing preprocessed data
"models", For storing trained or pre-trained models
"notebooks", For experimentation and EDA
"scripts", For project-related scripts
"outputs", For storing results and retrieved documents
"configs", For configuration files
"logs", For storing logs
]
print("\nSetting up project directories...")
for directory in PROJECT_DIRS:
os.makedirs(directory, exist_ok=True)
print(f"Created directory: {directory}")
print("Project directories created successfully!")
Step 3: Download NLP models and datasets
def download_nlp_resources():
"""
Downloads necessary NLP models and resources for preprocessing.
Returns:
None
"""
import nltk
import spacy
print("\nDownloading NLTK resources...")
nltk.download('punkt') Tokenizer
nltk.download('stopwords') Stopword lists
nltk.download('wordnet') WordNet for lemmatization
print("Downloading SpaCy language model...")
subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"]) English model
print("All NLP resources downloaded successfully!")
download_nlp_resources()
Final message
print("\nProject setup completed successfully! You are ready to proceed.")
Explanation
Step 1: Install Required Packages
- The script installs essential Python libraries using
pip
.- NumPy and Pandas for handling data.
- NLTK and SpaCy for text preprocessing.
- Transformers for integrating pre-trained LLMs.
- FAISS for efficient similarity search.
- ElasticSearch for full-text indexing and search.
Step 2: Create Project Directories
- Creates a standard project directory structure:
- data: Store raw datasets.
- preprocessed: Store cleaned/preprocessed datasets.
- models: Save model files.
- notebooks: Use Jupyter notebooks for exploratory data analysis (EDA).
- scripts: Store Python scripts for modular development.
- outputs: Store retrieval results and logs.
- configs: Store configuration files like model parameters.
- logs: Maintain logs for debugging.
Step 3: Download NLP Resources
- Downloads essential NLP resources for preprocessing:
- NLTK:
punkt
tokenizer,stopwords
, andwordnet
. - SpaCy: Pre-trained English model for advanced text processing.
- NLTK:
How to Run the Script
- Save this script as
setup.py
. - Run it in your terminal or Python environment:
python setup.py
- Verify that:
- Required packages are installed.
- Project directories are created.
- NLP models and datasets are downloaded.
2.2. Dataset Preparation
Here’s the code for Part 2: Dataset Preparation, including detailed documentation and explanations.
Code for Dataset Preparation
"""
Information Retrieval System - Dataset Preparation
---------------------------------------------------
This script loads, cleans, and preprocesses text data for use in the Information Retrieval System.
It includes:
1. Loading raw data
2. Text cleaning (lowercasing, removing punctuation, etc.)
3. Tokenization and stopword removal
4. Lemmatization
"""
import os
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import json
Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Constants
RAW_DATA_PATH = "data/raw_dataset.json" Path to the raw dataset
PROCESSED_DATA_PATH = "preprocessed/cleaned_data.csv" Output path for cleaned data
Step 1: Load raw dataset
def load_dataset(file_path):
"""
Loads a dataset from a JSON file.
Args:
file_path (str): Path to the dataset file.
Returns:
list: A list of documents (text).
"""
print("Loading dataset...")
with open(file_path, 'r', encoding='utf-8') as file:
data = json.load(file)
print(f"Loaded {len(data)} documents.")
return data
Step 2: Clean text data
def clean_text(text):
"""
Cleans the input text by removing special characters, URLs, and converting to lowercase.
Args:
text (str): The raw text.
Returns:
str: The cleaned text.
"""
text = re.sub(r"http\S+", "", text) Remove URLs
text = re.sub(r"[^a-zA-Z\s]", "", text) Remove non-alphabetic characters
text = text.lower() Convert to lowercase
return text
Step 3: Preprocess text (tokenization, stopword removal, lemmatization)
def preprocess_text(text):
"""
Preprocesses the input text:
1. Tokenizes the text
2. Removes stopwords
3. Lemmatizes tokens
Args:
text (str): The cleaned text.
Returns:
str: Preprocessed text as a single string.
"""
tokens = word_tokenize(text) Tokenization
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words] Remove stopwords
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens] Lemmatization
return " ".join(lemmatized_tokens)
Step 4: Process dataset and save
def process_and_save(data, output_path):
"""
Cleans and preprocesses a list of documents and saves them to a CSV file.
Args:
data (list): A list of documents.
output_path (str): Path to save the processed data.
Returns:
None
"""
processed_data = []
for i, document in enumerate(data):
print(f"Processing document {i+1}/{len(data)}...")
cleaned_text = clean_text(document['text'])
preprocessed_text = preprocess_text(cleaned_text)
processed_data.append({"id": document['id'], "processed_text": preprocessed_text})
df = pd.DataFrame(processed_data)
df.to_csv(output_path, index=False)
print(f"Processed data saved to {output_path}.")
Main execution
if __name__ == "__main__":
Step 1: Load dataset
dataset = load_dataset(RAW_DATA_PATH)
Step 2: Process and save dataset
process_and_save(dataset, PROCESSED_DATA_PATH)
print("Dataset preparation completed successfully!")
Explanation
Step 1: Load Raw Dataset
- Assumes the dataset is in a JSON file where each document is a dictionary with keys like
id
andtext
. - Loads the data into memory.
Step 2: Text Cleaning
- Removes unnecessary components from text:
- URLs: Using regular expressions (
re.sub
). - Non-alphabetic characters: To keep only meaningful words.
- Converts text to lowercase for uniformity.
- URLs: Using regular expressions (
Step 3: Text Preprocessing
- Tokenization: Splits text into individual words using NLTK’s
word_tokenize
. - Stopword Removal: Removes common, non-informative words (e.g., “and”, “the”).
- Lemmatization: Converts words to their base form (e.g., “running” → “run”) using WordNet Lemmatizer.
Step 4: Process Dataset and Save
- Iterates through all documents in the dataset.
- Applies the cleaning and preprocessing pipeline.
- Saves the results into a CSV file for further use.
How to Run the Script
- Place your raw dataset in the
data
folder with the nameraw_dataset.json
. Format example:[ {"id": "1", "text": "The quick brown fox jumps over the lazy dog."}, {"id": "2", "text": "Information retrieval is a fascinating field of study."} ]
- Save the script as
prepare_dataset.py
. - Run the script:
python prepare_dataset.py
- Verify the output in the
preprocessed/cleaned_data.csv
file.
2.3. Indexing System
Here’s the code for Part 3: Indexing System, with detailed documentation and explanations.
Code for Indexing System
"""
Information Retrieval System - Indexing System
-----------------------------------------------
This script creates an indexing mechanism to store document embeddings
for efficient semantic search. It uses FAISS for similarity search and
SentenceTransformers for generating embeddings.
"""
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import os
import pickle
Constants
PROCESSED_DATA_PATH = "preprocessed/cleaned_data.csv" Path to preprocessed data
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2" Hugging Face embedding model
INDEX_SAVE_PATH = "models/faiss_index.pkl" Path to save the FAISS index
EMBEDDINGS_SAVE_PATH = "models/document_embeddings.pkl" Path to save embeddings
Step 1: Load preprocessed data
def load_preprocessed_data(file_path):
"""
Loads preprocessed data from a CSV file.
Args:
file_path (str): Path to the preprocessed data.
Returns:
pd.DataFrame: DataFrame containing the preprocessed data.
"""
print("Loading preprocessed data...")
data = pd.read_csv(file_path)
print(f"Loaded {len(data)} documents.")
return data
Step 2: Generate embeddings for documents
def generate_embeddings(documents, model_name):
"""
Generates embeddings for the given documents using a pre-trained model.
Args:
documents (list): List of text documents.
model_name (str): Name of the Hugging Face model for embedding generation.
Returns:
list: List of embeddings.
"""
print(f"Loading embedding model: {model_name}...")
model = SentenceTransformer(model_name)
print("Generating embeddings...")
embeddings = model.encode(documents, show_progress_bar=True)
print("Embeddings generated successfully!")
return embeddings
Step 3: Create FAISS index
def create_faiss_index(embeddings):
"""
Creates a FAISS index for the given embeddings.
Args:
embeddings (list): List of embeddings.
Returns:
faiss.IndexFlatL2: A FAISS index for similarity search.
"""
print("Creating FAISS index...")
dimension = embeddings.shape[1] Get the dimension of embeddings
index = faiss.IndexFlatL2(dimension) L2 similarity index
index.add(embeddings) Add embeddings to the index
print(f"FAISS index created with {index.ntotal} documents.")
return index
Step 4: Save index and embeddings
def save_index_and_embeddings(index, embeddings, index_path, embeddings_path):
"""
Saves the FAISS index and embeddings to disk.
Args:
index (faiss.IndexFlatL2): The FAISS index.
embeddings (list): The embeddings.
index_path (str): Path to save the FAISS index.
embeddings_path (str): Path to save the embeddings.
Returns:
None
"""
print("Saving FAISS index and embeddings...")
faiss.write_index(index, index_path)
with open(embeddings_path, 'wb') as f:
pickle.dump(embeddings, f)
print("FAISS index and embeddings saved successfully!")
Main execution
if __name__ == "__main__":
Step 1: Load preprocessed data
data = load_preprocessed_data(PROCESSED_DATA_PATH)
Step 2: Generate embeddings
documents = data['processed_text'].tolist()
embeddings = generate_embeddings(documents, EMBEDDING_MODEL_NAME)
Convert embeddings to NumPy array
import numpy as np
embeddings_array = np.array(embeddings)
Step 3: Create FAISS index
faiss_index = create_faiss_index(embeddings_array)
Step 4: Save index and embeddings
save_index_and_embeddings(faiss_index, embeddings_array, INDEX_SAVE_PATH, EMBEDDINGS_SAVE_PATH)
print("Indexing system completed successfully!")
Explanation
Step 1: Load Preprocessed Data
- Reads the cleaned dataset from a CSV file into a Pandas DataFrame.
- The
processed_text
column contains the preprocessed documents.
Step 2: Generate Embeddings
- Uses SentenceTransformers to convert documents into numerical embeddings.
- Embeddings capture semantic meaning and are required for efficient search.
- Model used:
all-MiniLM-L6-v2
(a lightweight and efficient embedding model).
Step 3: Create FAISS Index
- FAISS is a library for similarity search.
- The script creates an L2 similarity index:
- Measures similarity between query and document embeddings using Euclidean distance.
- Adds embeddings to the FAISS index.
Step 4: Save Index and Embeddings
- Saves the FAISS index to disk for future use using
faiss.write_index
. - Saves document embeddings separately as a backup using Python’s
pickle
.
How to Run the Script
- Ensure the preprocessed data is available at
preprocessed/cleaned_data.csv
. - Save the script as
indexing.py
. - Run the script:
python indexing.py
- Verify:
- The FAISS index is saved at
models/faiss_index.pkl
. - Document embeddings are saved at
models/document_embeddings.pkl
.
- The FAISS index is saved at
2.4. Query Processing
Here’s the code for Part 4: Query Processing, with detailed documentation and explanations.
Code for Query Processing
"""
Information Retrieval System - Query Processing
------------------------------------------------
This script handles user queries by:
1. Preprocessing the query to match the document pipeline.
2. Generating embeddings for the query.
3. Searching the FAISS index for the most relevant documents.
"""
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import pickle
Constants
PROCESSED_DATA_PATH = "preprocessed/cleaned_data.csv" Path to preprocessed data
INDEX_PATH = "models/faiss_index.pkl" Path to FAISS index
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2" Embedding model
TOP_K = 5 Number of top results to return
Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Step 1: Load FAISS index and embeddings
def load_faiss_index(index_path):
"""
Loads the FAISS index from disk.
Args:
index_path (str): Path to the FAISS index file.
Returns:
faiss.IndexFlatL2: The loaded FAISS index.
"""
print("Loading FAISS index...")
index = faiss.read_index(index_path)
print("FAISS index loaded successfully!")
return index
Step 2: Load preprocessed data
def load_preprocessed_data(file_path):
"""
Loads the preprocessed dataset for mapping document IDs to text.
Args:
file_path (str): Path to the preprocessed data.
Returns:
pd.DataFrame: DataFrame containing document mappings.
"""
print("Loading preprocessed data...")
data = pd.read_csv(file_path)
return data
Step 3: Preprocess the query
def preprocess_query(query):
"""
Preprocesses the query:
1. Tokenizes the text.
2. Removes stopwords.
3. Lemmatizes tokens.
Args:
query (str): The raw query.
Returns:
str: The preprocessed query.
"""
print("Preprocessing query...")
query = re.sub(r"[^a-zA-Z\s]", "", query).lower() Clean and lowercase
tokens = word_tokenize(query) Tokenize
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words] Remove stopwords
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens] Lemmatize
return " ".join(lemmatized_tokens)
Step 4: Generate query embedding
def generate_query_embedding(query, model_name):
"""
Generates an embedding for the query using a pre-trained model.
Args:
query (str): The preprocessed query.
model_name (str): Name of the embedding model.
Returns:
np.ndarray: The query embedding.
"""
print(f"Loading embedding model: {model_name}...")
model = SentenceTransformer(model_name)
print("Generating query embedding...")
embedding = model.encode([query])
return np.array(embedding)
Step 5: Search FAISS index
def search_index(faiss_index, query_embedding, top_k=5):
"""
Searches the FAISS index for the most similar documents.
Args:
faiss_index (faiss.IndexFlatL2): The FAISS index.
query_embedding (np.ndarray): The query embedding.
top_k (int): Number of top results to return.
Returns:
list: List of document indices and distances.
"""
print(f"Searching FAISS index for top {top_k} results...")
distances, indices = faiss_index.search(query_embedding, top_k)
return indices[0], distances[0]
Step 6: Map results to documents
def map_results_to_documents(indices, data):
"""
Maps FAISS indices to document text using the preprocessed dataset.
Args:
indices (list): List of FAISS indices.
data (pd.DataFrame): Preprocessed dataset.
Returns:
list: List of matching documents.
"""
print("Mapping results to documents...")
results = []
for idx in indices:
document = data.iloc[idx]["processed_text"]
results.append(document)
return results
Main execution
if __name__ == "__main__":
Step 1: Load FAISS index and preprocessed data
faiss_index = load_faiss_index(INDEX_PATH)
data = load_preprocessed_data(PROCESSED_DATA_PATH)
Step 2: Accept user query
user_query = input("Enter your query: ")
Step 3: Preprocess the query
preprocessed_query = preprocess_query(user_query)
Step 4: Generate query embedding
query_embedding = generate_query_embedding(preprocessed_query, EMBEDDING_MODEL_NAME)
Step 5: Search the index
indices, distances = search_index(faiss_index, query_embedding, TOP_K)
Step 6: Map results to documents
results = map_results_to_documents(indices, data)
Display results
print("\nTop Results:")
for i, (doc, dist) in enumerate(zip(results, distances), 1):
print(f"{i}. Document: {doc}\n Distance: {dist:.4f}")
Explanation
Step 1: Load FAISS Index
- Loads the FAISS index from disk using
faiss.read_index
.
Step 2: Load Preprocessed Data
- Loads the CSV file containing the preprocessed text.
- Used to map document indices from FAISS to actual text.
Step 3: Preprocess Query
- Cleans the user query to ensure it matches the preprocessing applied to documents:
- Removes special characters and stopwords.
- Tokenizes and lemmatizes the text.
Step 4: Generate Query Embedding
- Converts the preprocessed query into an embedding using the same model as the indexing step.
Step 5: Search FAISS Index
- Searches the FAISS index for the top
k
most similar embeddings. - Returns indices and distances for the closest matches.
Step 6: Map Results to Documents
- Maps the FAISS indices back to the document text for display.
How to Run the Script
- Save the script as
query_processing.py
. - Ensure the FAISS index (
faiss_index.pkl
) and preprocessed data (cleaned_data.csv
) are available. - Run the script:
python query_processing.py
- Enter a query when prompted, e.g.,
"What is information retrieval?"
.
2.5. Test the full pipeline
To test the full pipeline of the Information Retrieval System, here’s how you can proceed:
Test Plan for the Full Pipeline
-
Prepare a Sample Dataset:
- Use a small, manageable dataset to validate the pipeline.
- Example dataset format:
[ {"id": "1", "text": "Information retrieval is about finding information from large datasets."}, {"id": "2", "text": "Semantic search improves search accuracy by understanding the query's meaning."}, {"id": "3", "text": "Machine learning is a core technology behind modern search engines."}, {"id": "4", "text": "Natural Language Processing enables computers to understand human language."}, {"id": "5", "text": "Deep learning techniques are widely used in image and text analysis."} ]
-
Load the Dataset:
- Place the dataset in
data/raw_dataset.json
.
- Place the dataset in
-
Run Each Script Sequentially:
- Step 1: Run
prepare_dataset.py
to clean and preprocess the data. - Step 2: Run
indexing.py
to build the FAISS index and generate embeddings. - Step 3: Run
query_processing.py
to process a query and retrieve results.
- Step 1: Run
-
Enter Test Queries:
- Example queries:
"What is semantic search?"
"Explain machine learning in search engines."
"How does deep learning relate to text analysis?"
- Example queries:
Step-by-Step Testing
1. Prepare the Sample Dataset
Save the following JSON content as data/raw_dataset.json
:
[
{"id": "1", "text": "Information retrieval is about finding information from large datasets."},
{"id": "2", "text": "Semantic search improves search accuracy by understanding the query's meaning."},
{"id": "3", "text": "Machine learning is a core technology behind modern search engines."},
{"id": "4", "text": "Natural Language Processing enables computers to understand human language."},
{"id": "5", "text": "Deep learning techniques are widely used in image and text analysis."}
]
2. Run Dataset Preparation Run the following command to clean and preprocess the dataset:
python prepare_dataset.py
- Expected Output:
preprocessed/cleaned_data.csv
should contain cleaned and tokenized versions of the dataset.
3. Run Indexing Run the following command to create the FAISS index:
python indexing.py
- Expected Output:
- FAISS index saved as
models/faiss_index.pkl
. - Document embeddings saved as
models/document_embeddings.pkl
.
- FAISS index saved as
4. Test Query Processing Run the query processing script:
python query_processing.py
-
Enter example queries, such as:
"What is semantic search?"
"Explain machine learning in search engines."
-
Expected Output: The system should return the most relevant documents, sorted by similarity.
Example:
Enter your query: What is semantic search?
Top Results:
1. Document: Semantic search improves search accuracy by understanding the query's meaning.
Distance: 0.3542
2. Document: Machine learning is a core technology behind modern search engines.
Distance: 0.5628
Debugging Tips
- If no results are returned: Ensure that embeddings are generated and saved correctly.
- If results are inaccurate: Check that the same embedding model was used for indexing and querying.
- If FAISS index fails to load: Verify the file paths (
models/faiss_index.pkl
).
2.6. Evaluation and testing
Here’s how to evaluate and test the Information Retrieval System using performance metrics like Precision, Recall, and nDCG (Normalized Discounted Cumulative Gain).
Evaluation Plan
-
Define Relevance Judgments:
- Create a set of sample queries and manually mark relevant documents from the dataset.
-
Metrics for Evaluation:
- Precision: Proportion of retrieved documents that are relevant. \[ \text{Precision} = \frac{\text{Relevant Documents Retrieved}}{\text{Total Retrieved Documents}} \]
- Recall: Proportion of relevant documents retrieved out of all relevant documents in the dataset. \[ \text{Recall} = \frac{\text{Relevant Documents Retrieved}}{\text{Total Relevant Documents}} \]
- nDCG: Measures ranking quality by rewarding higher ranks for relevant documents. \[ \text{DCG} = \sum_{i=1}^{n} \frac{\text{relevance}_i}{\log_2(i + 1)} \] \[ \text{nDCG} = \frac{\text{DCG}}{\text{IDCG}} \]
-
Test Framework:
- Use a list of predefined queries with relevance judgments.
- Measure the system’s performance for each query and compute the average metrics.
Code for Evaluation and Testing
"""
Information Retrieval System - Evaluation and Testing
------------------------------------------------------
This script evaluates the retrieval system using Precision, Recall, and nDCG metrics.
"""
import numpy as np
Constants
TOP_K = 5 Number of top results to evaluate
Step 1: Define test queries and relevance judgments
queries = [
{"query": "What is semantic search?", "relevant_docs": [2]},
{"query": "Explain machine learning in search engines.", "relevant_docs": [3]},
{"query": "How does deep learning relate to text analysis?", "relevant_docs": [5]},
]
Step 2: Compute Precision
def compute_precision(retrieved_indices, relevant_docs):
"""
Computes Precision for a single query.
Args:
retrieved_indices (list): Indices of documents retrieved by the system.
relevant_docs (list): Indices of relevant documents.
Returns:
float: Precision score.
"""
relevant_retrieved = len(set(retrieved_indices).intersection(set(relevant_docs)))
return relevant_retrieved / len(retrieved_indices)
Step 3: Compute Recall
def compute_recall(retrieved_indices, relevant_docs):
"""
Computes Recall for a single query.
Args:
retrieved_indices (list): Indices of documents retrieved by the system.
relevant_docs (list): Indices of relevant documents.
Returns:
float: Recall score.
"""
relevant_retrieved = len(set(retrieved_indices).intersection(set(relevant_docs)))
return relevant_retrieved / len(relevant_docs)
Step 4: Compute nDCG
def compute_ndcg(retrieved_indices, relevant_docs):
"""
Computes nDCG for a single query.
Args:
retrieved_indices (list): Indices of documents retrieved by the system.
relevant_docs (list): Indices of relevant documents.
Returns:
float: nDCG score.
"""
dcg = 0
for i, idx in enumerate(retrieved_indices):
if idx in relevant_docs:
dcg += 1 / np.log2(i + 2) i + 2 because indexing starts at 0
Compute ideal DCG
idcg = sum([1 / np.log2(i + 2) for i in range(min(len(relevant_docs), TOP_K))])
return dcg / idcg if idcg > 0 else 0
Step 5: Evaluate the system
def evaluate_system(faiss_index, embedding_model, data):
"""
Evaluates the system using predefined queries and metrics.
Args:
faiss_index (faiss.IndexFlatL2): The FAISS index.
embedding_model (SentenceTransformer): The embedding model.
data (pd.DataFrame): Preprocessed data.
Returns:
None
"""
precisions, recalls, ndcgs = [], [], []
for test_case in queries:
query = test_case["query"]
relevant_docs = test_case["relevant_docs"]
Preprocess and embed the query
preprocessed_query = preprocess_query(query)
query_embedding = generate_query_embedding(preprocessed_query, EMBEDDING_MODEL_NAME)
Retrieve results
indices, _ = search_index(faiss_index, query_embedding, TOP_K)
retrieved_indices = indices.tolist()
Compute metrics
precision = compute_precision(retrieved_indices, relevant_docs)
recall = compute_recall(retrieved_indices, relevant_docs)
ndcg = compute_ndcg(retrieved_indices, relevant_docs)
precisions.append(precision)
recalls.append(recall)
ndcgs.append(ndcg)
Print results for each query
print(f"\nQuery: {query}")
print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, nDCG: {ndcg:.4f}")
Compute average metrics
print("\nOverall Evaluation:")
print(f"Average Precision: {np.mean(precisions):.4f}")
print(f"Average Recall: {np.mean(recalls):.4f}")
print(f"Average nDCG: {np.mean(ndcgs):.4f}")
Main execution
if __name__ == "__main__":
Load FAISS index and preprocessed data
faiss_index = load_faiss_index(INDEX_PATH)
data = load_preprocessed_data(PROCESSED_DATA_PATH)
Load the embedding model
embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)
Evaluate the system
evaluate_system(faiss_index, embedding_model, data)
Explanation
Step 1: Define Test Queries
- List of queries with relevance judgments (indices of relevant documents).
Step 2: Compute Metrics
- Precision: Proportion of relevant documents among the retrieved results.
- Recall: Proportion of relevant documents retrieved out of all relevant ones.
- nDCG: Measures ranking quality, rewarding relevant documents at higher ranks.
Step 3: Evaluate the System
- For each query:
- Preprocess the query and embed it.
- Search the FAISS index.
- Compute metrics.
- Print metrics for individual queries and their averages.
How to Run the Script
- Save this as
evaluation.py
. - Ensure
faiss_index.pkl
andcleaned_data.csv
are available. - Run the script:
python evaluation.py