Codestin Search App

📌 Important Note on Execution Environment

This project was designed and executed in Google Colab.

Due to the high computational requirements of Neural Vector Embedding (Sentence-Transformers) and the I/O intensity of Web Crawling, this system utilizes Google Colab's cloud resources.

Why Colab? The system builds a dense vector index (FAISS) and processes thousands of concurrent network requests. Running this on a standard laptop may result in memory overflows or network throttling.
Data Storage: The system automatically mounts Google Drive to persist the crawled corpus and index artifacts, preventing data loss during runtime disconnects.

🚀 Project Architecture

This is not a basic keyword matching engine. It is a Hybrid Neural Search Engine that combines symbolic processing with dense vector retrieval.

1. Ingestion Layer (Scrapy)

Strategy: Breadth-First Search (BFS) to prioritize high-level topic pages.
Politeness: Implements AutoThrottle and respects robots.txt.
Integrity: Uses URL Canonicalization to prevent duplicate indexing.

2. Indexing Layer (Dual-Mode)

Positional Index (JSON): Stores exact term positions for phrase verification (Academic Requirement).
Neural Vector Index (FAISS): Encodes documents into 384-dimensional vectors using all-MiniLM-L6-v2 for semantic similarity search (Advanced Implementation).

3. Processing Layer (Hybrid Pipeline)

Step 1: Correction: Automated spell-checking via pyspellchecker.
Step 2: Expansion: Symbolic query expansion using WordNet synonyms.
Step 3: Retrieval: k-Nearest Neighbor (k-NN) search using FAISS.

🛠️ How to Run

Method 1: Google Colab (Recommended)

This is the intended method for viewing the report and executing the code.

Upload the IR_Project_Report.ipynb file to your Google Drive.
Open with Google Colab.
Run All Cells:
- The system will request permission to mount Google Drive.
- It will automatically install dependencies (faiss-cpu, scrapy, etc.).
- It will crawl Wikipedia, build the index, and launch the Web UI.

Method 2: Local Machine (Advanced)

If you wish to run this locally, you must have a robust environment.

Prerequisites:

Python 3.8+
16GB RAM (Recommended for Vector Embedding)
Stable Internet Connection

Setup:

Install dependencies:
```
pip install -r requirements.txt
```
Modify the Notebook:
- Locate the CONFIG block in the "Setup" cell.
- Change BASE_PATH from /content/drive/... to a local path (e.g., ./data).
- Remove drive.mount().
Run via Jupyter Lab.

📂 Repository Contents

IR_Project_Report.ipynb: The complete source code, documentation, and execution log.
requirements.txt: List of Python libraries required.
crawl_data: (Optional) Pre-crawled HTML documents to save time.
index_data: (Optional) Pre-built FAISS and JSON indices.

📊 Artifacts Generated

Upon execution, the system generates a submission folder containing:

queries.csv: Sample queries with top-K ranked results.
results.csv: Detailed relevance scores and DocIDs.
index.json: A partial export of the Positional Inverted Index.
sample.html: Representative raw HTML files from the crawl.
url.txt: The seed URL used for the crawl.

📜 System capabilities

Precision: High semantic recall (can match "canine" to "dog").
Scalability: Designed with CloseSpider limits to prevent infinite crawling loops.
Interface: Includes a Flask-based Web UI with Keyword-In-Context (KWIC) snippets.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Submission_Artifacts		Submission_Artifacts
crawl_data		crawl_data
index_data		index_data
.gitattributes		.gitattributes
IR_project.ipynb		IR_project.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📌 Important Note on Execution Environment

🚀 Project Architecture

1. Ingestion Layer (Scrapy)

2. Indexing Layer (Dual-Mode)

3. Processing Layer (Hybrid Pipeline)

🛠️ How to Run

Method 1: Google Colab (Recommended)

Method 2: Local Machine (Advanced)

📂 Repository Contents

📊 Artifacts Generated

📜 System capabilities

About

Uh oh!

Releases

Packages

Languages

varsh2001/IR_Project_final

Folders and files

Latest commit

History

Repository files navigation

📌 Important Note on Execution Environment

🚀 Project Architecture

1. Ingestion Layer (Scrapy)

2. Indexing Layer (Dual-Mode)

3. Processing Layer (Hybrid Pipeline)

🛠️ How to Run

Method 1: Google Colab (Recommended)

Method 2: Local Machine (Advanced)

📂 Repository Contents

📊 Artifacts Generated

📜 System capabilities

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages