This repository enables semantic search and question-answering on research papers using PDF ingestion, vector databases, and local or cloud-hosted LLMs. It supports both OpenAI-compatible models (e.g., via LM Studio) and HuggingFace models like Falcon, integrating them into a LangChain-powered RAG pipeline.
- Overview
- Architecture
- Components
- Data Extraction
- Data Transformation
- Data Loading
- Visualization
- Automation
- Prerequisites
- Setup
- Usage
- Contributing
- License
https://www.loom.com/share/e317ef7be6cf4338bf0d58cb0be2a39a?sid=511c6d13-ffa2-4a46-b9a1-d004910ede5b
This project allows users to query research paper PDFs in natural language. It processes the paper, stores chunk embeddings in a Chroma vector database, and retrieves relevant context at query time using LangChain and either OpenAI-compatible APIs (e.g., LM Studio) or HuggingFace models.
PDF -> LangChain Splitter -> Embeddings (OpenAI / HF) -> ChromaDB
↓ ↑
Query → Embed → Retrieve Similar Chunks → Prompt LLM → Answer
- LangChain: Provides the orchestration layer for RAG.
- ChromaDB: Vector database to store and retrieve document embeddings.
- HuggingFace / OpenAI Models: Used for embedding and LLM completion.
- LM Studio: Used as a local server to host OpenAI-compatible models.
The system loads a research paper from ../data/paper1.pdf using PyPDFLoader. Each page is treated as a LangChain Document and metadata is attached to track the source.
The text is chunked into overlapping sections using RecursiveCharacterTextSplitter, then embedded using:
text-embedding-nomic-embed-text-v1.5(via LM Studio API inlmstudio.py)- or
HuggingFaceEmbeddings()(inmain.py)
Embeddings are persisted in a Chroma vector store:
lmstudio.pyuses the raw Chroma Python client (chromadb) and you have to run a LLM and embedding model locally using LMStudio (I did on Windows)main.pyuseslangchain_chroma.Chromawith built-in LangChain support
Each document chunk is uniquely ID'd and tagged with its source metadata.
While the system is CLI-driven, it provides:
- Terminal display of matching document chunks
- Full prompts for debugging
- Grounding reference for every LLM-generated answer
Once set up, you can interactively query the ingested paper:
$ python main.py --llm mistral
Ask a question (or type 'exit'): What is the main contribution of the paper?In lmstudio.py, the prompt is handcrafted and passed to the model via direct API calls to an OpenAI-compatible server[Which is running locally on LM Studio]
- Python 3.9+
- NVIDIA GPU (for local inference)
- LM Studio (optional for local inference)
- ChromaDB installed locally
-
Clone and Install Requirements
git clone https://github.com/your-username/research-paper-qa.git cd research-paper-qa pip install -r requirements.txt -
Add
.envfileOPENAI_API_KEY=your-key-if-using-openai -
Download a PDF Place your paper at:
../data/paper1.pdf -
Optional: Start LM Studio
- Load a model like
gemma:7bormistral - Expose OpenAI-compatible API on
localhost:1234
- Load a model like
python lmstudio.py --llm openaipython main.py --llm mistralYou’ll be prompted to enter your question. The system will embed it, perform retrieval, and generate an answer using the selected LLM.
Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License. See the LICENSE file for details.