A lightweight, fast, and secure RAG (Retrieval-Augmented Generation) system built using FastAPI, Qdrant, and a local LLaMA 2 model. This project demonstrates how to augment LLM responses with document-based context for improved accuracy and relevance.
- Vector-based document storage and retrieval using Qdrant
- Local LLaMA 2 integration using
llama-cpp-python - Basic API key protection for secure access
- Sentence Transformers for text embeddings
- FastAPI fro the web interface
- Simple and modular architecture
- CLI and cURL testing support
- Logging for observability
[ User Query ]
↓
[ /ask API ]
↓
[ Qdrant Retriever ] ← [ Embedded Document Chunks ]
↓
[ LLaMA 2 Generator ]
↓
[ Response to User ]
-
Install basic requirements
pip install fastapi==0.110.0 uvicorn==0.29.0
-
Install sentence-transformers
pip install sentence-transformers==2.5.1
-
Install PyTorch (CPU version)
pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cpu
-
Install transformers
pip install transformers==4.40.1
-
Install Qdrant client
pip install qdrant-client==1.7.0
-
Install remaining dependencies
pip install python-dotenv==1.0.1 requests==2.31.0
-
Install llama-cpp-python with Metal support
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python==0.2.24
-
initialize the Database:
python scripts/process_documents.py
-
Start the Server:
uvicorn app.main:app --reload
-
Test the API:
- Using curl:
curl -X 'GET' \ 'http://localhost:8000/ask?question=What%20is%20the%20opportunity%20cost%3F' \ -H 'accept: application/json' \ -H 'x-api-key: a9oH2PDkGTFSX6aYSHvyswtDKz9HYYWsM2DmnW-8qGk' Response body { "question": "What is the opportunity cost?", "answer": "The opportunity cost is what you give up by choosing one option over another. In this case, it's the amount of money that could have been earned through compound interest if the triplets had left their allowances in a savings account instead of giving them to their grandma for safekeeping." }
- Using the Swagger UI:
- Open
http://localhost:8000/docsin your browser - Click on Authorize, enter the api key and save
- Click on the
/askendpoint - Click "Try it out"
- Enter your question and API key
- Click "Execute"
- Open
- Using curl:
-
API Key:
- Set your API key in
api_key_middleware.py - For production, use environment variables
- Set your API key in
-
LLaMA Model:
- Update the model path in
llm.py - Ensure you have the correct model file
- Update the model path in
-
Qdrant Storage:
- Default: in-memory storage
- For persistence, update the client initialization in
retriever.py
- The system uses CPU-based inference by default
- Response times may vary based on:
- Hardware specifications
- Number of documents in the database
- Complexity of queries
- For better performance:
- Use GPU if available (modify llama-cpp-python installation)
- Adjust the number of retrieved documents (k parameter)
- Consider using a smaller LLaMA model variant
- Documents are processed in chunks for better context management
- Embeddings are generated once during document addition
- Supported document formats:
- Text files (.txt)
- More formats can be added by extending the document processor
- Document chunks are stored in Qdrant with their embeddings
-
API Key Issues:
- Ensure API key is set in environment variables
- Check API key format in requests
- Verify API key middleware is properly configured
-
Model Loading Issues:
- Verify model file exists in the correct location
- Check model file format (.gguf)
- Ensure sufficient system memory
-
Database Issues:
- Check Qdrant data directory permissions
- Verify document processing completed successfully
- Check database connection settings
-
Adding New Documents:
# Add your document to the project # Then run the document processor python scripts/process_documents.py
-
Modifying the System:
- Document embeddings: Modify
app/retriever.py - LLM settings: Update
app/llm.py - API endpoints: Edit
app/main.py - Security: Configure
app/api_key_middleware.py
- Document embeddings: Modify
-
Testing:
- Use the provided test scripts in the
test/directory - Run API tests:
python test/test_query.py - Check logs for debugging information
- Use the provided test scripts in the
Some example querries
-
who receives the most money in interest?
-
what is opportunity cost?
-
what should people compare before they make a trade off?
-
what is simple interest?
-
what is compound interest?
-
What is the difference between simple and compound interest in the story?
-
How much money will Diane have for the vacation?
-
What is Brian's opportunity cost in the story?
-
What lesson does the story teach about financial decisions?
-
What is Python?
-
What is Qdrant?