Thanks to visit codestin.com
Credit goes to github.com

Skip to content

varsh2001/IR_Project_final

Repository files navigation

📌 Important Note on Execution Environment

This project was designed and executed in Google Colab.

Due to the high computational requirements of Neural Vector Embedding (Sentence-Transformers) and the I/O intensity of Web Crawling, this system utilizes Google Colab's cloud resources.

  • Why Colab? The system builds a dense vector index (FAISS) and processes thousands of concurrent network requests. Running this on a standard laptop may result in memory overflows or network throttling.
  • Data Storage: The system automatically mounts Google Drive to persist the crawled corpus and index artifacts, preventing data loss during runtime disconnects.

🚀 Project Architecture

This is not a basic keyword matching engine. It is a Hybrid Neural Search Engine that combines symbolic processing with dense vector retrieval.

1. Ingestion Layer (Scrapy)

  • Strategy: Breadth-First Search (BFS) to prioritize high-level topic pages.
  • Politeness: Implements AutoThrottle and respects robots.txt.
  • Integrity: Uses URL Canonicalization to prevent duplicate indexing.

2. Indexing Layer (Dual-Mode)

  • Positional Index (JSON): Stores exact term positions for phrase verification (Academic Requirement).
  • Neural Vector Index (FAISS): Encodes documents into 384-dimensional vectors using all-MiniLM-L6-v2 for semantic similarity search (Advanced Implementation).

3. Processing Layer (Hybrid Pipeline)

  • Step 1: Correction: Automated spell-checking via pyspellchecker.
  • Step 2: Expansion: Symbolic query expansion using WordNet synonyms.
  • Step 3: Retrieval: k-Nearest Neighbor (k-NN) search using FAISS.

🛠️ How to Run

Method 1: Google Colab (Recommended)

This is the intended method for viewing the report and executing the code.

  1. Upload the IR_Project_Report.ipynb file to your Google Drive.
  2. Open with Google Colab.
  3. Run All Cells:
    • The system will request permission to mount Google Drive.
    • It will automatically install dependencies (faiss-cpu, scrapy, etc.).
    • It will crawl Wikipedia, build the index, and launch the Web UI.

Method 2: Local Machine (Advanced)

If you wish to run this locally, you must have a robust environment.

Prerequisites:

  • Python 3.8+
  • 16GB RAM (Recommended for Vector Embedding)
  • Stable Internet Connection

Setup:

  1. Install dependencies:
    pip install -r requirements.txt
  2. Modify the Notebook:
    • Locate the CONFIG block in the "Setup" cell.
    • Change BASE_PATH from /content/drive/... to a local path (e.g., ./data).
    • Remove drive.mount().
  3. Run via Jupyter Lab.

📂 Repository Contents

  • IR_Project_Report.ipynb: The complete source code, documentation, and execution log.
  • requirements.txt: List of Python libraries required.
  • crawl_data: (Optional) Pre-crawled HTML documents to save time.
  • index_data: (Optional) Pre-built FAISS and JSON indices.

📊 Artifacts Generated

Upon execution, the system generates a submission folder containing:

  • queries.csv: Sample queries with top-K ranked results.
  • results.csv: Detailed relevance scores and DocIDs.
  • index.json: A partial export of the Positional Inverted Index.
  • sample.html: Representative raw HTML files from the crawl.
  • url.txt: The seed URL used for the crawl.

📜 System capabilities

  • Precision: High semantic recall (can match "canine" to "dog").
  • Scalability: Designed with CloseSpider limits to prevent infinite crawling loops.
  • Interface: Includes a Flask-based Web UI with Keyword-In-Context (KWIC) snippets.

About

IR project Fall 25(A20563861)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published