LLM Mini Projects: Hands-On Applications of Large Language Models



Raj Shaikh    31 min read    6438 words

1. RAG System

1.1. Encoder

Let’s start with the Encoder module for a Retrieval-Augmented Generation (RAG) system. The Encoder is responsible for converting text data (documents or queries) into vector representations (embeddings) that are used for efficient similarity matching during retrieval.

Here’s a step-by-step plan for the Encoder module:

Key Components of the Encoder:

  1. Pre-trained Language Model: Use a model like sentence-transformers, HuggingFace Transformers, or similar, to encode documents into embeddings.
  2. Input Handling: Manage inputs (e.g., document text, query text).
  3. Vectorization: Generate fixed-size dense embeddings.
  4. Storage Integration: Optionally save embeddings for later retrieval using vector databases like Pinecone, Weaviate, or FAISS.

Let’s write the code for this module.


Code: Encoder Module

 Import required libraries
from sentence_transformers import SentenceTransformer
import numpy as np
import os
import json


class Encoder:
    """
    Encoder class for converting text data into dense vector representations.
    
    This module uses a pre-trained SentenceTransformer model to generate embeddings for
    documents or queries, enabling efficient similarity-based retrieval.
    """

    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """
        Initialize the Encoder with a pre-trained model.

        Parameters:
        - model_name (str): The name of the pre-trained model to load from SentenceTransformers.
        """
        self.model_name = model_name
        self.model = SentenceTransformer(model_name)
        print(f"Encoder initialized with model: {self.model_name}")

    def encode(self, texts: list, batch_size: int = 32) -> np.ndarray:
        """
        Convert a list of text documents into dense vector embeddings.

        Parameters:
        - texts (list): A list of text strings to encode.
        - batch_size (int): Batch size for encoding to optimize speed and memory usage.

        Returns:
        - np.ndarray: A 2D array where each row corresponds to the embedding of a text.
        """
        if not isinstance(texts, list) or not all(isinstance(text, str) for text in texts):
            raise ValueError("Input 'texts' must be a list of strings.")
        
        embeddings = self.model.encode(texts, batch_size=batch_size, show_progress_bar=True)
        return np.array(embeddings)

    def save_embeddings(self, embeddings: np.ndarray, output_path: str):
        """
        Save embeddings to a file for later retrieval.

        Parameters:
        - embeddings (np.ndarray): The embeddings to save.
        - output_path (str): The path to save the embeddings file.
        """
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        np.save(output_path, embeddings)
        print(f"Embeddings saved to: {output_path}")

    def load_embeddings(self, input_path: str) -> np.ndarray:
        """
        Load embeddings from a saved file.

        Parameters:
        - input_path (str): The path of the saved embeddings file.

        Returns:
        - np.ndarray: The loaded embeddings.
        """
        if not os.path.exists(input_path):
            raise FileNotFoundError(f"Embeddings file not found: {input_path}")
        
        embeddings = np.load(input_path)
        print(f"Embeddings loaded from: {input_path}")
        return embeddings

    def save_texts(self, texts: list, output_path: str):
        """
        Save the original texts corresponding to the embeddings for reference.

        Parameters:
        - texts (list): The list of text documents.
        - output_path (str): The path to save the texts file (as JSON).
        """
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(texts, f, ensure_ascii=False, indent=4)
        print(f"Texts saved to: {output_path}")

    def load_texts(self, input_path: str) -> list:
        """
        Load the original texts from a saved file.

        Parameters:
        - input_path (str): The path of the saved texts file.

        Returns:
        - list: The list of text documents.
        """
        if not os.path.exists(input_path):
            raise FileNotFoundError(f"Texts file not found: {input_path}")
        
        with open(input_path, 'r', encoding='utf-8') as f:
            texts = json.load(f)
        print(f"Texts loaded from: {input_path}")
        return texts


 Example usage
if __name__ == "__main__":
     Initialize encoder
    encoder = Encoder()

     Sample data
    documents = [
        "Machine learning is fascinating.",
        "Natural Language Processing is a key area of AI.",
        "RAG combines retrieval and generation for better results."
    ]

     Generate embeddings
    embeddings = encoder.encode(documents)

     Save embeddings and texts
    encoder.save_embeddings(embeddings, "data/embeddings.npy")
    encoder.save_texts(documents, "data/texts.json")

     Load embeddings and texts
    loaded_embeddings = encoder.load_embeddings("data/embeddings.npy")
    loaded_texts = encoder.load_texts("data/texts.json")

    print("Loaded embeddings shape:", loaded_embeddings.shape)
    print("Loaded texts:", loaded_texts)

Explanation of the Code

  1. Initialization:

    • The Encoder class initializes with a pre-trained model from sentence-transformers.
    • Default model: 'all-MiniLM-L6-v2' (lightweight and efficient).
  2. Encoding:

    • The encode method accepts a list of strings and generates embeddings using the pre-trained model.
    • Batch processing ensures memory-efficient encoding.
  3. Saving and Loading:

    • Embeddings and their corresponding texts are saved to disk for later use. Embeddings are saved in .npy format (NumPy arrays), while texts are saved as JSON.
  4. Error Handling:

    • Input validation ensures that the input is a list of strings.
    • File existence checks handle cases where the user tries to load missing files.
  5. Example Workflow:

    • Sample documents are encoded, saved, and reloaded to demonstrate the module’s functionality.

1.2. Vector DB

Now that the Encoder module is ready, let’s move forward and integrate it with a vector database for efficient similarity search. We’ll use FAISS (Facebook AI Similarity Search), which is a popular open-source library for efficient similarity search and clustering of dense embeddings.


Key Steps for Vector Database Integration

  1. FAISS Setup:

    • Use FAISS to store and search embeddings.
    • Configure an index for similarity search (e.g., IndexFlatL2 for cosine similarity).
  2. Index Management:

    • Add encoded embeddings to the FAISS index.
    • Perform similarity queries on the index.
  3. Persistence:

    • Save the FAISS index to disk for reuse.
    • Load the index when needed.
  4. Integration:

    • Combine FAISS with the Encoder module to create a seamless retrieval pipeline.

Code: FAISS Integration

import faiss
import numpy as np

class VectorDatabase:
    """
    Vector Database using FAISS for similarity search.
    """

    def __init__(self, embedding_dim: int):
        """
        Initialize the FAISS vector database.

        Parameters:
        - embedding_dim (int): The dimensionality of the embeddings.
        """
        self.embedding_dim = embedding_dim
        self.index = faiss.IndexFlatL2(embedding_dim)   L2 distance for similarity search
        print(f"FAISS index initialized with embedding dimension: {self.embedding_dim}")

    def add_embeddings(self, embeddings: np.ndarray):
        """
        Add embeddings to the FAISS index.

        Parameters:
        - embeddings (np.ndarray): A 2D array of embeddings to add.
        """
        if embeddings.shape[1] != self.embedding_dim:
            raise ValueError("Embedding dimension does not match the index configuration.")
        
        self.index.add(embeddings)
        print(f"Added {embeddings.shape[0]} embeddings to the index.")

    def search(self, query_embeddings: np.ndarray, k: int = 5):
        """
        Search for the top-k similar embeddings.

        Parameters:
        - query_embeddings (np.ndarray): A 2D array of query embeddings.
        - k (int): Number of nearest neighbors to retrieve.

        Returns:
        - distances (np.ndarray): Distances of the top-k neighbors.
        - indices (np.ndarray): Indices of the top-k neighbors.
        """
        if query_embeddings.shape[1] != self.embedding_dim:
            raise ValueError("Query embedding dimension does not match the index configuration.")
        
        distances, indices = self.index.search(query_embeddings, k)
        return distances, indices

    def save_index(self, file_path: str):
        """
        Save the FAISS index to a file.

        Parameters:
        - file_path (str): Path to save the FAISS index.
        """
        faiss.write_index(self.index, file_path)
        print(f"FAISS index saved to: {file_path}")

    def load_index(self, file_path: str):
        """
        Load a FAISS index from a file.

        Parameters:
        - file_path (str): Path to load the FAISS index from.
        """
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"FAISS index file not found: {file_path}")
        
        self.index = faiss.read_index(file_path)
        print(f"FAISS index loaded from: {file_path}")


 Example usage
if __name__ == "__main__":
    from sentence_transformers import SentenceTransformer
    
     Initialize encoder and database
    encoder = Encoder()
    documents = [
        "Machine learning is fascinating.",
        "Natural Language Processing is a key area of AI.",
        "RAG combines retrieval and generation for better results.",
        "Deep learning is a subset of machine learning.",
        "Artificial Intelligence is transforming industries."
    ]
    
     Encode documents
    embeddings = encoder.encode(documents)

     Initialize FAISS database
    vector_db = VectorDatabase(embedding_dim=embeddings.shape[1])
    vector_db.add_embeddings(embeddings)

     Save index
    vector_db.save_index("data/faiss_index.bin")

     Load index
    vector_db.load_index("data/faiss_index.bin")

     Perform a query
    query = ["What is machine learning?"]
    query_embeddings = encoder.encode(query)
    distances, indices = vector_db.search(query_embeddings, k=3)

    print("\nQuery:", query[0])
    print("\nTop 3 similar documents:")
    for i, idx in enumerate(indices[0]):
        print(f"{i + 1}. {documents[idx]} (Distance: {distances[0][i]:.4f})")

Explanation of the Code

  1. Initialization:

    • The VectorDatabase class initializes a FAISS index using IndexFlatL2, which computes the L2 distance (euclidean distance) for similarity.
  2. Adding Embeddings:

    • The add_embeddings method accepts a NumPy array of embeddings and adds them to the index.
  3. Querying:

    • The search method takes query embeddings and returns the top-k similar embeddings from the index, along with their distances and indices.
  4. Persistence:

    • save_index and load_index handle saving and loading of the FAISS index to/from disk.
  5. Example Workflow:

    • Embeddings for sample documents are encoded and added to the FAISS index.
    • The index is saved and reloaded for demonstration.
    • A sample query is encoded, and the most similar documents are retrieved.

Output Example

For the query "What is machine learning?", the output might look like this:

Query: What is machine learning?

Top 3 similar documents:
1. Machine learning is fascinating. (Distance: 0.2374)
2. Deep learning is a subset of machine learning. (Distance: 0.3498)
3. Artificial Intelligence is transforming industries. (Distance: 0.5821)

1.3. Retrieval

Now, let’s integrate the retrieval process, where we use the FAISS index to fetch documents most relevant to a query. This module will bridge the Encoder and the Generator components in a RAG system.


Retrieval Integration Plan

  1. Retrieve Documents:

    • Use the FAISS index to find the most relevant documents for a given query.
    • Return both the text of the documents and the similarity scores.
  2. Document Storage:

    • Use a simple mechanism to store and retrieve documents (e.g., JSON or a database).
  3. Combine Retrieval and Encoding:

    • Encode the query using the Encoder module.
    • Use the FAISS index to retrieve relevant documents based on embeddings.
  4. Output:

    • Return a structured response containing relevant documents and their similarity scores.

Code: Retrieval Integration

class Retriever:
    """
    Retriever class for fetching the most relevant documents using FAISS.
    """

    def __init__(self, encoder: Encoder, vector_db: VectorDatabase, document_store_path: str):
        """
        Initialize the Retriever with an Encoder, VectorDatabase, and Document Store.

        Parameters:
        - encoder (Encoder): An instance of the Encoder class.
        - vector_db (VectorDatabase): An instance of the VectorDatabase class.
        - document_store_path (str): Path to the stored documents (JSON file).
        """
        self.encoder = encoder
        self.vector_db = vector_db
        self.document_store_path = document_store_path
        self.documents = self.load_documents()

    def load_documents(self):
        """
        Load documents from the document store.

        Returns:
        - list: A list of documents loaded from the JSON file.
        """
        if not os.path.exists(self.document_store_path):
            raise FileNotFoundError(f"Document store file not found: {self.document_store_path}")
        
        with open(self.document_store_path, 'r', encoding='utf-8') as f:
            documents = json.load(f)
        print(f"Loaded {len(documents)} documents from: {self.document_store_path}")
        return documents

    def retrieve(self, query: str, k: int = 5):
        """
        Retrieve the top-k relevant documents for a query.

        Parameters:
        - query (str): The query string.
        - k (int): Number of top documents to retrieve.

        Returns:
        - list of dict: A list of dictionaries containing 'document' and 'score'.
        """
         Encode the query
        query_embedding = self.encoder.encode([query])

         Search the vector database
        distances, indices = self.vector_db.search(query_embedding, k)

         Collect the results
        results = []
        for i, idx in enumerate(indices[0]):
            document = self.documents[idx]
            score = distances[0][i]
            results.append({"document": document, "score": score})

        return results


 Example usage
if __name__ == "__main__":
     Initialize encoder and FAISS vector database
    encoder = Encoder()
    vector_db = VectorDatabase(embedding_dim=384)   Use the correct embedding dimension

     Initialize retriever with a document store
    retriever = Retriever(encoder, vector_db, "data/texts.json")

     Perform retrieval for a query
    query = "Explain machine learning concepts."
    top_documents = retriever.retrieve(query, k=3)

    print("\nQuery:", query)
    print("\nTop 3 retrieved documents:")
    for i, result in enumerate(top_documents):
        print(f"{i + 1}. Document: {result['document']} (Score: {result['score']:.4f})")

Explanation of the Code

  1. Retriever Initialization:

    • Combines the Encoder and VectorDatabase modules.
    • Loads documents from a JSON file (document_store_path) to map retrieved indices to the actual document texts.
  2. Document Loading:

    • Loads a pre-saved document list in JSON format.
    • This document list corresponds to the embeddings stored in the FAISS index.
  3. Retrieval:

    • Encodes the query using the Encoder.
    • Performs a FAISS search to find the top-k relevant embeddings.
    • Maps the retrieved indices to the actual document texts.
  4. Structured Output:

    • Returns a list of dictionaries, each containing a document and its similarity score.

Example Workflow

  1. Store Documents: Ensure documents are saved in data/texts.json:

    [
        "Machine learning is fascinating.",
        "Natural Language Processing is a key area of AI.",
        "RAG combines retrieval and generation for better results.",
        "Deep learning is a subset of machine learning.",
        "Artificial Intelligence is transforming industries."
    ]
  2. Perform Retrieval: For the query "Explain machine learning concepts.", the output might look like this:

    Query: Explain machine learning concepts.
    
    Top 3 retrieved documents:
    1. Document: Machine learning is fascinating. (Score: 0.2374)
    2. Document: Deep learning is a subset of machine learning. (Score: 0.3498)
    3. Document: Artificial Intelligence is transforming industries. (Score: 0.5821)

1.4. Generator

Now, we will integrate the Generator module, the final component of the Retrieval-Augmented Generation (RAG) system. The Generator takes the retrieved documents and the query as input to generate a context-aware, relevant response.

Key Steps for the Generator Module

  1. Language Model:

    • Use a pre-trained model like OpenAI’s GPT, HuggingFace’s Transformers (e.g., gpt-neo, t5), or similar.
    • Fine-tuning is optional but not necessary for a basic RAG system.
  2. Input Preparation:

    • Combine the query and retrieved documents into a format suitable for the model (e.g., concatenated text).
  3. Response Generation:

    • Use the language model to generate a response based on the input context.
  4. Integration:

    • Combine the Generator with the Retriever to form the end-to-end pipeline.

Code: Generator Module

from transformers import AutoModelForCausalLM, AutoTokenizer

class Generator:
    """
    Generator class for producing context-aware responses using a language model.
    """

    def __init__(self, model_name: str = 'gpt2'):
        """
        Initialize the Generator with a pre-trained language model.

        Parameters:
        - model_name (str): The name of the pre-trained model to load.
        """
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        print(f"Generator initialized with model: {self.model_name}")

    def generate_response(self, query: str, retrieved_docs: list, max_length: int = 150) -> str:
        """
        Generate a response using the query and retrieved documents as context.

        Parameters:
        - query (str): The input query.
        - retrieved_docs (list): List of retrieved documents (strings).
        - max_length (int): Maximum length of the generated response.

        Returns:
        - str: The generated response.
        """
         Prepare the input text by combining query and context
        context = "\n".join(retrieved_docs)
        input_text = f"Context:\n{context}\n\nQuery: {query}\n\nAnswer:"

         Tokenize input
        inputs = self.tokenizer.encode(input_text, return_tensors='pt', truncation=True)

         Generate response
        outputs = self.model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            no_repeat_ngram_size=2,
            early_stopping=True
        )
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response


 Example usage
if __name__ == "__main__":
     Initialize retriever and generator
    encoder = Encoder()
    vector_db = VectorDatabase(embedding_dim=384)
    retriever = Retriever(encoder, vector_db, "data/texts.json")
    generator = Generator(model_name='gpt2')

     Perform retrieval
    query = "Explain the importance of machine learning."
    top_documents = retriever.retrieve(query, k=3)

     Extract document texts
    retrieved_docs = [result["document"] for result in top_documents]

     Generate a response
    response = generator.generate_response(query, retrieved_docs)
    print("\nGenerated Response:")
    print(response)

Explanation of the Code

  1. Initialization:

    • The Generator class uses a pre-trained language model from HuggingFace’s Transformers library (default: gpt2).
    • The tokenizer and model are loaded during initialization.
  2. Input Preparation:

    • Retrieved documents are concatenated with the query to form a context for the generator.
    • The format is designed to guide the model in generating a coherent response.
  3. Response Generation:

    • The generate_response method tokenizes the input, feeds it into the model, and decodes the output.
    • Parameters like max_length, no_repeat_ngram_size, and early_stopping ensure high-quality, concise responses.
  4. Pipeline Integration:

    • Combined with the Retriever, the Generator produces responses using real-time retrieval as context.

Example Workflow

Query

"Explain the importance of machine learning."

Retrieved Documents

1. Machine learning is fascinating.
2. Deep learning is a subset of machine learning.
3. Artificial Intelligence is transforming industries.

Generated Response

Machine learning is a critical area of artificial intelligence that allows systems to learn and adapt from data without being explicitly programmed. It underpins advancements in deep learning and drives transformative applications across industries.

End-to-End Pipeline Assembly

To create the full RAG pipeline, you can integrate the Encoder, Retriever, and Generator into a single class or script. Here’s a basic structure:

class RAGPipeline:
    """
    End-to-End Retrieval-Augmented Generation (RAG) pipeline.
    """

    def __init__(self, encoder: Encoder, vector_db: VectorDatabase, retriever: Retriever, generator: Generator):
        """
        Initialize the RAG pipeline with all components.
        """
        self.encoder = encoder
        self.vector_db = vector_db
        self.retriever = retriever
        self.generator = generator

    def answer_query(self, query: str, top_k: int = 3, max_length: int = 150):
        """
        Generate an answer to a query using the RAG pipeline.

        Parameters:
        - query (str): The input query.
        - top_k (int): Number of top documents to retrieve.
        - max_length (int): Maximum length of the generated response.

        Returns:
        - str: The generated response.
        """
         Retrieve relevant documents
        top_documents = self.retriever.retrieve(query, k=top_k)
        retrieved_docs = [result["document"] for result in top_documents]

         Generate response
        response = self.generator.generate_response(query, retrieved_docs, max_length=max_length)
        return response


 Example usage
if __name__ == "__main__":
     Initialize components
    encoder = Encoder()
    vector_db = VectorDatabase(embedding_dim=384)
    retriever = Retriever(encoder, vector_db, "data/texts.json")
    generator = Generator(model_name='gpt2')

     Create RAG pipeline
    rag_pipeline = RAGPipeline(encoder, vector_db, retriever, generator)

     Answer a query
    query = "What is the impact of AI on industries?"
    response = rag_pipeline.answer_query(query)
    print("\nFinal Response:")
    print(response)

1.5. Deployment

Let’s proceed with deploying the RAG pipeline. The goal is to make the pipeline accessible via a web API, where users can submit queries and receive responses in real time.

Deployment Plan

  1. Framework Selection:

    • Use a lightweight, efficient web framework like FastAPI for creating and serving the API.
  2. API Design:

    • Define an endpoint to accept user queries and return generated responses.
    • Optionally add endpoints for health checks or additional metadata.
  3. Integration:

    • Load and initialize the Encoder, Retriever, Vector Database, and Generator within the API.
  4. Run Server:

    • Serve the API locally or deploy it to a platform like AWS, Google Cloud, or Heroku.

Code: RAG API with FastAPI

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

 Initialize FastAPI app
app = FastAPI()

 Define input and output schemas
class QueryRequest(BaseModel):
    query: str
    top_k: int = 3   Number of documents to retrieve

class QueryResponse(BaseModel):
    query: str
    response: str
    retrieved_documents: list

 Initialize RAG components
encoder = Encoder()
vector_db = VectorDatabase(embedding_dim=384)
retriever = Retriever(encoder, vector_db, "data/texts.json")
generator = Generator(model_name="gpt2")
rag_pipeline = RAGPipeline(encoder, vector_db, retriever, generator)

@app.get("/")
def health_check():
    """
    Health check endpoint to verify the API is running.
    """
    return {"status": "OK", "message": "RAG API is running."}

@app.post("/query", response_model=QueryResponse)
def answer_query(request: QueryRequest):
    """
    Endpoint to handle user queries and return RAG responses.

    Parameters:
    - query (str): The input query.
    - top_k (int): Number of top documents to retrieve.

    Returns:
    - QueryResponse: Contains the query, generated response, and retrieved documents.
    """
    try:
         Retrieve top documents
        top_documents = rag_pipeline.retriever.retrieve(request.query, k=request.top_k)
        retrieved_docs = [result["document"] for result in top_documents]

         Generate response
        response = rag_pipeline.generator.generate_response(request.query, retrieved_docs)

        return QueryResponse(
            query=request.query,
            response=response,
            retrieved_documents=retrieved_docs,
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


 Run the server with: uvicorn script_name:app --reload

Explanation of the Code

  1. FastAPI Initialization:

    • The FastAPI app is initialized, and routes are defined to handle requests.
  2. Endpoint Definitions:

    • Health Check (/):
      • Confirms the API is up and running.
    • Query Endpoint (/query):
      • Accepts a query and retrieves top documents.
      • Generates a response using the RAG pipeline.
      • Returns the query, generated response, and retrieved documents.
  3. Input and Output Validation:

    • Pydantic models (QueryRequest and QueryResponse) ensure structured input and output, enabling easier debugging and testing.
  4. Pipeline Integration:

    • The RAG pipeline components (Encoder, Retriever, VectorDatabase, and Generator) are initialized and used within the API.
  5. Error Handling:

    • Exceptions are caught and returned with a 500 status code, making the API more robust.

Running the Server

  1. Save the script as rag_api.py.

  2. Run the server locally using Uvicorn:

    uvicorn rag_api:app --reload
  3. Test the API:


Example Request and Response

Request:

{
    "query": "What is machine learning?",
    "top_k": 3
}

Response:

{
    "query": "What is machine learning?",
    "response": "Machine learning is a method of data analysis that automates analytical model building...",
    "retrieved_documents": [
        "Machine learning is fascinating.",
        "Deep learning is a subset of machine learning.",
        "Artificial Intelligence is transforming industries."
    ]
}

Deployment Options

  1. Local Deployment:

    • Test the API locally using Uvicorn.
  2. Cloud Deployment:

    • Deploy to platforms like AWS, Google Cloud, Azure, or Heroku.
    • Use Docker for containerization if needed.
  3. Scaling:

    • Integrate a load balancer for high traffic.
    • Use GPU instances for faster inference in production.

2. Information Retriever System

2.1. Project Setup

Here’s the code for Part 1: Project Setup and Requirements, with in-depth documentation and explanations.


Code for Project Setup and Requirements

"""
Information Retrieval System - Project Setup
--------------------------------------------
This script initializes the project environment and installs necessary dependencies.
It also sets up the basic folder structure for the project.
"""

import os
import subprocess
import sys

 Function to install required packages
def install_packages(packages):
    """
    Installs the given Python packages using pip.

    Args:
    packages (list): A list of package names to install.

    Returns:
    None
    """
    for package in packages:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

 Step 1: Install required Python packages
REQUIRED_PACKAGES = [
    "numpy",            For numerical computations
    "pandas",           For data manipulation
    "nltk",             For text preprocessing
    "spacy",            For advanced NLP processing
    "transformers",     For working with pre-trained models (Hugging Face)
    "faiss-cpu",        For efficient vector search
    "elasticsearch",    For full-text search and indexing
]

print("Installing required packages...")
install_packages(REQUIRED_PACKAGES)
print("All packages installed successfully!")

 Step 2: Set up project directory structure
PROJECT_DIRS = [
    "data",             For storing raw datasets
    "preprocessed",     For storing preprocessed data
    "models",           For storing trained or pre-trained models
    "notebooks",        For experimentation and EDA
    "scripts",          For project-related scripts
    "outputs",          For storing results and retrieved documents
    "configs",          For configuration files
    "logs",             For storing logs
]

print("\nSetting up project directories...")
for directory in PROJECT_DIRS:
    os.makedirs(directory, exist_ok=True)
    print(f"Created directory: {directory}")
print("Project directories created successfully!")

 Step 3: Download NLP models and datasets
def download_nlp_resources():
    """
    Downloads necessary NLP models and resources for preprocessing.

    Returns:
    None
    """
    import nltk
    import spacy

    print("\nDownloading NLTK resources...")
    nltk.download('punkt')     Tokenizer
    nltk.download('stopwords')   Stopword lists
    nltk.download('wordnet')     WordNet for lemmatization

    print("Downloading SpaCy language model...")
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])   English model
    print("All NLP resources downloaded successfully!")

download_nlp_resources()

 Final message
print("\nProject setup completed successfully! You are ready to proceed.")

Explanation

Step 1: Install Required Packages

  • The script installs essential Python libraries using pip.
    • NumPy and Pandas for handling data.
    • NLTK and SpaCy for text preprocessing.
    • Transformers for integrating pre-trained LLMs.
    • FAISS for efficient similarity search.
    • ElasticSearch for full-text indexing and search.

Step 2: Create Project Directories

  • Creates a standard project directory structure:
    • data: Store raw datasets.
    • preprocessed: Store cleaned/preprocessed datasets.
    • models: Save model files.
    • notebooks: Use Jupyter notebooks for exploratory data analysis (EDA).
    • scripts: Store Python scripts for modular development.
    • outputs: Store retrieval results and logs.
    • configs: Store configuration files like model parameters.
    • logs: Maintain logs for debugging.

Step 3: Download NLP Resources

  • Downloads essential NLP resources for preprocessing:
    • NLTK: punkt tokenizer, stopwords, and wordnet.
    • SpaCy: Pre-trained English model for advanced text processing.

How to Run the Script

  1. Save this script as setup.py.
  2. Run it in your terminal or Python environment:
    python setup.py
  3. Verify that:
    • Required packages are installed.
    • Project directories are created.
    • NLP models and datasets are downloaded.

2.2. Dataset Preparation

Here’s the code for Part 2: Dataset Preparation, including detailed documentation and explanations.


Code for Dataset Preparation

"""
Information Retrieval System - Dataset Preparation
---------------------------------------------------
This script loads, cleans, and preprocesses text data for use in the Information Retrieval System.
It includes:
1. Loading raw data
2. Text cleaning (lowercasing, removing punctuation, etc.)
3. Tokenization and stopword removal
4. Lemmatization
"""

import os
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import json

 Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

 Constants
RAW_DATA_PATH = "data/raw_dataset.json"   Path to the raw dataset
PROCESSED_DATA_PATH = "preprocessed/cleaned_data.csv"   Output path for cleaned data

 Step 1: Load raw dataset
def load_dataset(file_path):
    """
    Loads a dataset from a JSON file.

    Args:
    file_path (str): Path to the dataset file.

    Returns:
    list: A list of documents (text).
    """
    print("Loading dataset...")
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    print(f"Loaded {len(data)} documents.")
    return data

 Step 2: Clean text data
def clean_text(text):
    """
    Cleans the input text by removing special characters, URLs, and converting to lowercase.

    Args:
    text (str): The raw text.

    Returns:
    str: The cleaned text.
    """
    text = re.sub(r"http\S+", "", text)   Remove URLs
    text = re.sub(r"[^a-zA-Z\s]", "", text)   Remove non-alphabetic characters
    text = text.lower()   Convert to lowercase
    return text

 Step 3: Preprocess text (tokenization, stopword removal, lemmatization)
def preprocess_text(text):
    """
    Preprocesses the input text:
    1. Tokenizes the text
    2. Removes stopwords
    3. Lemmatizes tokens

    Args:
    text (str): The cleaned text.

    Returns:
    str: Preprocessed text as a single string.
    """
    tokens = word_tokenize(text)   Tokenization
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]   Remove stopwords
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]   Lemmatization
    return " ".join(lemmatized_tokens)

 Step 4: Process dataset and save
def process_and_save(data, output_path):
    """
    Cleans and preprocesses a list of documents and saves them to a CSV file.

    Args:
    data (list): A list of documents.
    output_path (str): Path to save the processed data.

    Returns:
    None
    """
    processed_data = []
    for i, document in enumerate(data):
        print(f"Processing document {i+1}/{len(data)}...")
        cleaned_text = clean_text(document['text'])
        preprocessed_text = preprocess_text(cleaned_text)
        processed_data.append({"id": document['id'], "processed_text": preprocessed_text})
    
    df = pd.DataFrame(processed_data)
    df.to_csv(output_path, index=False)
    print(f"Processed data saved to {output_path}.")

 Main execution
if __name__ == "__main__":
     Step 1: Load dataset
    dataset = load_dataset(RAW_DATA_PATH)

     Step 2: Process and save dataset
    process_and_save(dataset, PROCESSED_DATA_PATH)

    print("Dataset preparation completed successfully!")

Explanation

Step 1: Load Raw Dataset

  • Assumes the dataset is in a JSON file where each document is a dictionary with keys like id and text.
  • Loads the data into memory.

Step 2: Text Cleaning

  • Removes unnecessary components from text:
    • URLs: Using regular expressions (re.sub).
    • Non-alphabetic characters: To keep only meaningful words.
    • Converts text to lowercase for uniformity.

Step 3: Text Preprocessing

  • Tokenization: Splits text into individual words using NLTK’s word_tokenize.
  • Stopword Removal: Removes common, non-informative words (e.g., “and”, “the”).
  • Lemmatization: Converts words to their base form (e.g., “running” → “run”) using WordNet Lemmatizer.

Step 4: Process Dataset and Save

  • Iterates through all documents in the dataset.
  • Applies the cleaning and preprocessing pipeline.
  • Saves the results into a CSV file for further use.

How to Run the Script

  1. Place your raw dataset in the data folder with the name raw_dataset.json. Format example:
    [
        {"id": "1", "text": "The quick brown fox jumps over the lazy dog."},
        {"id": "2", "text": "Information retrieval is a fascinating field of study."}
    ]
  2. Save the script as prepare_dataset.py.
  3. Run the script:
    python prepare_dataset.py
  4. Verify the output in the preprocessed/cleaned_data.csv file.

2.3. Indexing System

Here’s the code for Part 3: Indexing System, with detailed documentation and explanations.


Code for Indexing System

"""
Information Retrieval System - Indexing System
-----------------------------------------------
This script creates an indexing mechanism to store document embeddings
for efficient semantic search. It uses FAISS for similarity search and 
SentenceTransformers for generating embeddings.
"""

import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import os
import pickle

 Constants
PROCESSED_DATA_PATH = "preprocessed/cleaned_data.csv"   Path to preprocessed data
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"   Hugging Face embedding model
INDEX_SAVE_PATH = "models/faiss_index.pkl"   Path to save the FAISS index
EMBEDDINGS_SAVE_PATH = "models/document_embeddings.pkl"   Path to save embeddings

 Step 1: Load preprocessed data
def load_preprocessed_data(file_path):
    """
    Loads preprocessed data from a CSV file.

    Args:
    file_path (str): Path to the preprocessed data.

    Returns:
    pd.DataFrame: DataFrame containing the preprocessed data.
    """
    print("Loading preprocessed data...")
    data = pd.read_csv(file_path)
    print(f"Loaded {len(data)} documents.")
    return data

 Step 2: Generate embeddings for documents
def generate_embeddings(documents, model_name):
    """
    Generates embeddings for the given documents using a pre-trained model.

    Args:
    documents (list): List of text documents.
    model_name (str): Name of the Hugging Face model for embedding generation.

    Returns:
    list: List of embeddings.
    """
    print(f"Loading embedding model: {model_name}...")
    model = SentenceTransformer(model_name)
    print("Generating embeddings...")
    embeddings = model.encode(documents, show_progress_bar=True)
    print("Embeddings generated successfully!")
    return embeddings

 Step 3: Create FAISS index
def create_faiss_index(embeddings):
    """
    Creates a FAISS index for the given embeddings.

    Args:
    embeddings (list): List of embeddings.

    Returns:
    faiss.IndexFlatL2: A FAISS index for similarity search.
    """
    print("Creating FAISS index...")
    dimension = embeddings.shape[1]   Get the dimension of embeddings
    index = faiss.IndexFlatL2(dimension)   L2 similarity index
    index.add(embeddings)   Add embeddings to the index
    print(f"FAISS index created with {index.ntotal} documents.")
    return index

 Step 4: Save index and embeddings
def save_index_and_embeddings(index, embeddings, index_path, embeddings_path):
    """
    Saves the FAISS index and embeddings to disk.

    Args:
    index (faiss.IndexFlatL2): The FAISS index.
    embeddings (list): The embeddings.
    index_path (str): Path to save the FAISS index.
    embeddings_path (str): Path to save the embeddings.

    Returns:
    None
    """
    print("Saving FAISS index and embeddings...")
    faiss.write_index(index, index_path)
    with open(embeddings_path, 'wb') as f:
        pickle.dump(embeddings, f)
    print("FAISS index and embeddings saved successfully!")

 Main execution
if __name__ == "__main__":
     Step 1: Load preprocessed data
    data = load_preprocessed_data(PROCESSED_DATA_PATH)
    
     Step 2: Generate embeddings
    documents = data['processed_text'].tolist()
    embeddings = generate_embeddings(documents, EMBEDDING_MODEL_NAME)
    
     Convert embeddings to NumPy array
    import numpy as np
    embeddings_array = np.array(embeddings)
    
     Step 3: Create FAISS index
    faiss_index = create_faiss_index(embeddings_array)
    
     Step 4: Save index and embeddings
    save_index_and_embeddings(faiss_index, embeddings_array, INDEX_SAVE_PATH, EMBEDDINGS_SAVE_PATH)
    
    print("Indexing system completed successfully!")

Explanation

Step 1: Load Preprocessed Data

  • Reads the cleaned dataset from a CSV file into a Pandas DataFrame.
  • The processed_text column contains the preprocessed documents.

Step 2: Generate Embeddings

  • Uses SentenceTransformers to convert documents into numerical embeddings.
  • Embeddings capture semantic meaning and are required for efficient search.
  • Model used: all-MiniLM-L6-v2 (a lightweight and efficient embedding model).

Step 3: Create FAISS Index

  • FAISS is a library for similarity search.
  • The script creates an L2 similarity index:
    • Measures similarity between query and document embeddings using Euclidean distance.
  • Adds embeddings to the FAISS index.

Step 4: Save Index and Embeddings

  • Saves the FAISS index to disk for future use using faiss.write_index.
  • Saves document embeddings separately as a backup using Python’s pickle.

How to Run the Script

  1. Ensure the preprocessed data is available at preprocessed/cleaned_data.csv.
  2. Save the script as indexing.py.
  3. Run the script:
    python indexing.py
  4. Verify:
    • The FAISS index is saved at models/faiss_index.pkl.
    • Document embeddings are saved at models/document_embeddings.pkl.

2.4. Query Processing

Here’s the code for Part 4: Query Processing, with detailed documentation and explanations.


Code for Query Processing

"""
Information Retrieval System - Query Processing
------------------------------------------------
This script handles user queries by:
1. Preprocessing the query to match the document pipeline.
2. Generating embeddings for the query.
3. Searching the FAISS index for the most relevant documents.
"""

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import pickle

 Constants
PROCESSED_DATA_PATH = "preprocessed/cleaned_data.csv"   Path to preprocessed data
INDEX_PATH = "models/faiss_index.pkl"   Path to FAISS index
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"   Embedding model
TOP_K = 5   Number of top results to return

 Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

 Step 1: Load FAISS index and embeddings
def load_faiss_index(index_path):
    """
    Loads the FAISS index from disk.

    Args:
    index_path (str): Path to the FAISS index file.

    Returns:
    faiss.IndexFlatL2: The loaded FAISS index.
    """
    print("Loading FAISS index...")
    index = faiss.read_index(index_path)
    print("FAISS index loaded successfully!")
    return index

 Step 2: Load preprocessed data
def load_preprocessed_data(file_path):
    """
    Loads the preprocessed dataset for mapping document IDs to text.

    Args:
    file_path (str): Path to the preprocessed data.

    Returns:
    pd.DataFrame: DataFrame containing document mappings.
    """
    print("Loading preprocessed data...")
    data = pd.read_csv(file_path)
    return data

 Step 3: Preprocess the query
def preprocess_query(query):
    """
    Preprocesses the query:
    1. Tokenizes the text.
    2. Removes stopwords.
    3. Lemmatizes tokens.

    Args:
    query (str): The raw query.

    Returns:
    str: The preprocessed query.
    """
    print("Preprocessing query...")
    query = re.sub(r"[^a-zA-Z\s]", "", query).lower()   Clean and lowercase
    tokens = word_tokenize(query)   Tokenize
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]   Remove stopwords
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]   Lemmatize
    return " ".join(lemmatized_tokens)

 Step 4: Generate query embedding
def generate_query_embedding(query, model_name):
    """
    Generates an embedding for the query using a pre-trained model.

    Args:
    query (str): The preprocessed query.
    model_name (str): Name of the embedding model.

    Returns:
    np.ndarray: The query embedding.
    """
    print(f"Loading embedding model: {model_name}...")
    model = SentenceTransformer(model_name)
    print("Generating query embedding...")
    embedding = model.encode([query])
    return np.array(embedding)

 Step 5: Search FAISS index
def search_index(faiss_index, query_embedding, top_k=5):
    """
    Searches the FAISS index for the most similar documents.

    Args:
    faiss_index (faiss.IndexFlatL2): The FAISS index.
    query_embedding (np.ndarray): The query embedding.
    top_k (int): Number of top results to return.

    Returns:
    list: List of document indices and distances.
    """
    print(f"Searching FAISS index for top {top_k} results...")
    distances, indices = faiss_index.search(query_embedding, top_k)
    return indices[0], distances[0]

 Step 6: Map results to documents
def map_results_to_documents(indices, data):
    """
    Maps FAISS indices to document text using the preprocessed dataset.

    Args:
    indices (list): List of FAISS indices.
    data (pd.DataFrame): Preprocessed dataset.

    Returns:
    list: List of matching documents.
    """
    print("Mapping results to documents...")
    results = []
    for idx in indices:
        document = data.iloc[idx]["processed_text"]
        results.append(document)
    return results

 Main execution
if __name__ == "__main__":
     Step 1: Load FAISS index and preprocessed data
    faiss_index = load_faiss_index(INDEX_PATH)
    data = load_preprocessed_data(PROCESSED_DATA_PATH)

     Step 2: Accept user query
    user_query = input("Enter your query: ")

     Step 3: Preprocess the query
    preprocessed_query = preprocess_query(user_query)

     Step 4: Generate query embedding
    query_embedding = generate_query_embedding(preprocessed_query, EMBEDDING_MODEL_NAME)

     Step 5: Search the index
    indices, distances = search_index(faiss_index, query_embedding, TOP_K)

     Step 6: Map results to documents
    results = map_results_to_documents(indices, data)

     Display results
    print("\nTop Results:")
    for i, (doc, dist) in enumerate(zip(results, distances), 1):
        print(f"{i}. Document: {doc}\n   Distance: {dist:.4f}")

Explanation

Step 1: Load FAISS Index

  • Loads the FAISS index from disk using faiss.read_index.

Step 2: Load Preprocessed Data

  • Loads the CSV file containing the preprocessed text.
  • Used to map document indices from FAISS to actual text.

Step 3: Preprocess Query

  • Cleans the user query to ensure it matches the preprocessing applied to documents:
    • Removes special characters and stopwords.
    • Tokenizes and lemmatizes the text.

Step 4: Generate Query Embedding

  • Converts the preprocessed query into an embedding using the same model as the indexing step.

Step 5: Search FAISS Index

  • Searches the FAISS index for the top k most similar embeddings.
  • Returns indices and distances for the closest matches.

Step 6: Map Results to Documents

  • Maps the FAISS indices back to the document text for display.

How to Run the Script

  1. Save the script as query_processing.py.
  2. Ensure the FAISS index (faiss_index.pkl) and preprocessed data (cleaned_data.csv) are available.
  3. Run the script:
    python query_processing.py
  4. Enter a query when prompted, e.g., "What is information retrieval?".

2.5. Test the full pipeline

To test the full pipeline of the Information Retrieval System, here’s how you can proceed:


Test Plan for the Full Pipeline

  1. Prepare a Sample Dataset:

    • Use a small, manageable dataset to validate the pipeline.
    • Example dataset format:
      [
          {"id": "1", "text": "Information retrieval is about finding information from large datasets."},
          {"id": "2", "text": "Semantic search improves search accuracy by understanding the query's meaning."},
          {"id": "3", "text": "Machine learning is a core technology behind modern search engines."},
          {"id": "4", "text": "Natural Language Processing enables computers to understand human language."},
          {"id": "5", "text": "Deep learning techniques are widely used in image and text analysis."}
      ]
  2. Load the Dataset:

    • Place the dataset in data/raw_dataset.json.
  3. Run Each Script Sequentially:

    • Step 1: Run prepare_dataset.py to clean and preprocess the data.
    • Step 2: Run indexing.py to build the FAISS index and generate embeddings.
    • Step 3: Run query_processing.py to process a query and retrieve results.
  4. Enter Test Queries:

    • Example queries:
      • "What is semantic search?"
      • "Explain machine learning in search engines."
      • "How does deep learning relate to text analysis?"

Step-by-Step Testing

1. Prepare the Sample Dataset Save the following JSON content as data/raw_dataset.json:

[
    {"id": "1", "text": "Information retrieval is about finding information from large datasets."},
    {"id": "2", "text": "Semantic search improves search accuracy by understanding the query's meaning."},
    {"id": "3", "text": "Machine learning is a core technology behind modern search engines."},
    {"id": "4", "text": "Natural Language Processing enables computers to understand human language."},
    {"id": "5", "text": "Deep learning techniques are widely used in image and text analysis."}
]

2. Run Dataset Preparation Run the following command to clean and preprocess the dataset:

python prepare_dataset.py
  • Expected Output: preprocessed/cleaned_data.csv should contain cleaned and tokenized versions of the dataset.

3. Run Indexing Run the following command to create the FAISS index:

python indexing.py
  • Expected Output:
    • FAISS index saved as models/faiss_index.pkl.
    • Document embeddings saved as models/document_embeddings.pkl.

4. Test Query Processing Run the query processing script:

python query_processing.py
  • Enter example queries, such as:

    • "What is semantic search?"
    • "Explain machine learning in search engines."
  • Expected Output: The system should return the most relevant documents, sorted by similarity.

Example:

Enter your query: What is semantic search?

Top Results:
1. Document: Semantic search improves search accuracy by understanding the query's meaning.
   Distance: 0.3542

2. Document: Machine learning is a core technology behind modern search engines.
   Distance: 0.5628

Debugging Tips

  • If no results are returned: Ensure that embeddings are generated and saved correctly.
  • If results are inaccurate: Check that the same embedding model was used for indexing and querying.
  • If FAISS index fails to load: Verify the file paths (models/faiss_index.pkl).

2.6. Evaluation and testing

Here’s how to evaluate and test the Information Retrieval System using performance metrics like Precision, Recall, and nDCG (Normalized Discounted Cumulative Gain).


Evaluation Plan

  1. Define Relevance Judgments:

    • Create a set of sample queries and manually mark relevant documents from the dataset.
  2. Metrics for Evaluation:

    • Precision: Proportion of retrieved documents that are relevant. \[ \text{Precision} = \frac{\text{Relevant Documents Retrieved}}{\text{Total Retrieved Documents}} \]
    • Recall: Proportion of relevant documents retrieved out of all relevant documents in the dataset. \[ \text{Recall} = \frac{\text{Relevant Documents Retrieved}}{\text{Total Relevant Documents}} \]
    • nDCG: Measures ranking quality by rewarding higher ranks for relevant documents. \[ \text{DCG} = \sum_{i=1}^{n} \frac{\text{relevance}_i}{\log_2(i + 1)} \] \[ \text{nDCG} = \frac{\text{DCG}}{\text{IDCG}} \]
  3. Test Framework:

    • Use a list of predefined queries with relevance judgments.
    • Measure the system’s performance for each query and compute the average metrics.

Code for Evaluation and Testing

"""
Information Retrieval System - Evaluation and Testing
------------------------------------------------------
This script evaluates the retrieval system using Precision, Recall, and nDCG metrics.
"""

import numpy as np

 Constants
TOP_K = 5   Number of top results to evaluate

 Step 1: Define test queries and relevance judgments
queries = [
    {"query": "What is semantic search?", "relevant_docs": [2]},
    {"query": "Explain machine learning in search engines.", "relevant_docs": [3]},
    {"query": "How does deep learning relate to text analysis?", "relevant_docs": [5]},
]

 Step 2: Compute Precision
def compute_precision(retrieved_indices, relevant_docs):
    """
    Computes Precision for a single query.

    Args:
    retrieved_indices (list): Indices of documents retrieved by the system.
    relevant_docs (list): Indices of relevant documents.

    Returns:
    float: Precision score.
    """
    relevant_retrieved = len(set(retrieved_indices).intersection(set(relevant_docs)))
    return relevant_retrieved / len(retrieved_indices)

 Step 3: Compute Recall
def compute_recall(retrieved_indices, relevant_docs):
    """
    Computes Recall for a single query.

    Args:
    retrieved_indices (list): Indices of documents retrieved by the system.
    relevant_docs (list): Indices of relevant documents.

    Returns:
    float: Recall score.
    """
    relevant_retrieved = len(set(retrieved_indices).intersection(set(relevant_docs)))
    return relevant_retrieved / len(relevant_docs)

 Step 4: Compute nDCG
def compute_ndcg(retrieved_indices, relevant_docs):
    """
    Computes nDCG for a single query.

    Args:
    retrieved_indices (list): Indices of documents retrieved by the system.
    relevant_docs (list): Indices of relevant documents.

    Returns:
    float: nDCG score.
    """
    dcg = 0
    for i, idx in enumerate(retrieved_indices):
        if idx in relevant_docs:
            dcg += 1 / np.log2(i + 2)   i + 2 because indexing starts at 0

     Compute ideal DCG
    idcg = sum([1 / np.log2(i + 2) for i in range(min(len(relevant_docs), TOP_K))])
    return dcg / idcg if idcg > 0 else 0

 Step 5: Evaluate the system
def evaluate_system(faiss_index, embedding_model, data):
    """
    Evaluates the system using predefined queries and metrics.

    Args:
    faiss_index (faiss.IndexFlatL2): The FAISS index.
    embedding_model (SentenceTransformer): The embedding model.
    data (pd.DataFrame): Preprocessed data.

    Returns:
    None
    """
    precisions, recalls, ndcgs = [], [], []

    for test_case in queries:
        query = test_case["query"]
        relevant_docs = test_case["relevant_docs"]

         Preprocess and embed the query
        preprocessed_query = preprocess_query(query)
        query_embedding = generate_query_embedding(preprocessed_query, EMBEDDING_MODEL_NAME)

         Retrieve results
        indices, _ = search_index(faiss_index, query_embedding, TOP_K)
        retrieved_indices = indices.tolist()

         Compute metrics
        precision = compute_precision(retrieved_indices, relevant_docs)
        recall = compute_recall(retrieved_indices, relevant_docs)
        ndcg = compute_ndcg(retrieved_indices, relevant_docs)

        precisions.append(precision)
        recalls.append(recall)
        ndcgs.append(ndcg)

         Print results for each query
        print(f"\nQuery: {query}")
        print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, nDCG: {ndcg:.4f}")

     Compute average metrics
    print("\nOverall Evaluation:")
    print(f"Average Precision: {np.mean(precisions):.4f}")
    print(f"Average Recall: {np.mean(recalls):.4f}")
    print(f"Average nDCG: {np.mean(ndcgs):.4f}")

 Main execution
if __name__ == "__main__":
     Load FAISS index and preprocessed data
    faiss_index = load_faiss_index(INDEX_PATH)
    data = load_preprocessed_data(PROCESSED_DATA_PATH)

     Load the embedding model
    embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME)

     Evaluate the system
    evaluate_system(faiss_index, embedding_model, data)

Explanation

Step 1: Define Test Queries

  • List of queries with relevance judgments (indices of relevant documents).

Step 2: Compute Metrics

  • Precision: Proportion of relevant documents among the retrieved results.
  • Recall: Proportion of relevant documents retrieved out of all relevant ones.
  • nDCG: Measures ranking quality, rewarding relevant documents at higher ranks.

Step 3: Evaluate the System

  • For each query:
    • Preprocess the query and embed it.
    • Search the FAISS index.
    • Compute metrics.
  • Print metrics for individual queries and their averages.

How to Run the Script

  1. Save this as evaluation.py.
  2. Ensure faiss_index.pkl and cleaned_data.csv are available.
  3. Run the script:
    python evaluation.py

Last updated on
Any doubt in content? Ask me anything?
Chat
Hi there! I'm the chatbot. Please tell me your query.