This project was designed and executed in Google Colab.
Due to the high computational requirements of Neural Vector Embedding (Sentence-Transformers) and the I/O intensity of Web Crawling, this system utilizes Google Colab's cloud resources.
- Why Colab? The system builds a dense vector index (FAISS) and processes thousands of concurrent network requests. Running this on a standard laptop may result in memory overflows or network throttling.
- Data Storage: The system automatically mounts Google Drive to persist the crawled corpus and index artifacts, preventing data loss during runtime disconnects.
This is not a basic keyword matching engine. It is a Hybrid Neural Search Engine that combines symbolic processing with dense vector retrieval.
- Strategy: Breadth-First Search (BFS) to prioritize high-level topic pages.
- Politeness: Implements
AutoThrottleand respectsrobots.txt. - Integrity: Uses URL Canonicalization to prevent duplicate indexing.
- Positional Index (JSON): Stores exact term positions for phrase verification (Academic Requirement).
- Neural Vector Index (FAISS): Encodes documents into 384-dimensional vectors using
all-MiniLM-L6-v2for semantic similarity search (Advanced Implementation).
- Step 1: Correction: Automated spell-checking via
pyspellchecker. - Step 2: Expansion: Symbolic query expansion using WordNet synonyms.
- Step 3: Retrieval: k-Nearest Neighbor (k-NN) search using FAISS.
This is the intended method for viewing the report and executing the code.
- Upload the
IR_Project_Report.ipynbfile to your Google Drive. - Open with Google Colab.
- Run All Cells:
- The system will request permission to mount Google Drive.
- It will automatically install dependencies (
faiss-cpu,scrapy, etc.). - It will crawl Wikipedia, build the index, and launch the Web UI.
If you wish to run this locally, you must have a robust environment.
Prerequisites:
- Python 3.8+
- 16GB RAM (Recommended for Vector Embedding)
- Stable Internet Connection
Setup:
- Install dependencies:
pip install -r requirements.txt
- Modify the Notebook:
- Locate the
CONFIGblock in the "Setup" cell. - Change
BASE_PATHfrom/content/drive/...to a local path (e.g.,./data). - Remove
drive.mount().
- Locate the
- Run via Jupyter Lab.
IR_Project_Report.ipynb: The complete source code, documentation, and execution log.requirements.txt: List of Python libraries required.crawl_data: (Optional) Pre-crawled HTML documents to save time.index_data: (Optional) Pre-built FAISS and JSON indices.
Upon execution, the system generates a submission folder containing:
queries.csv: Sample queries with top-K ranked results.results.csv: Detailed relevance scores and DocIDs.index.json: A partial export of the Positional Inverted Index.sample.html: Representative raw HTML files from the crawl.url.txt: The seed URL used for the crawl.
- Precision: High semantic recall (can match "canine" to "dog").
- Scalability: Designed with
CloseSpiderlimits to prevent infinite crawling loops. - Interface: Includes a Flask-based Web UI with Keyword-In-Context (KWIC) snippets.