This project implements a Streamlit web application that allows users to upload PDF documents and ask questions about their content. It leverages Google's Gemini Pro model via the google-generativeai library to understand the context from the PDF and generate relevant answers.
- PDF Upload: Easily upload PDF files through the Streamlit interface.
- Question Answering: Ask natural language questions about the content of the uploaded PDF.
- Contextual Answers: Utilizes Google Gemini Pro to generate answers based only on the information present in the PDF.
- Efficient Text Processing:
- Uses
PyPDF2to extract text from PDF pages. - Employs
tiktokenandlangchain.text_splitter.TokenTextSplitterfor intelligent, token-aware text chunking to handle large documents effectively.
- Uses
- Semantic Search:
- Generates embeddings for text chunks using
sentence-transformers/all-MiniLM-L6-v2vialangchain_community.embeddings.HuggingFaceEmbeddings. - Stores and searches embeddings efficiently using
FAISS, a library for efficient similarity search.
- Generates embeddings for text chunks using
- Dockerized: Includes a
Dockerfilefor easy containerization and deployment.
- Upload: The user uploads a PDF file via the Streamlit interface.
- Text Extraction: The application extracts text content from each page of the PDF using
PyPDF2. - Text Chunking: The extracted text is split into smaller, overlapping chunks using a token-based strategy (
TokenTextSplitter). This ensures that semantic context is preserved across chunk boundaries and respects model token limits. The chunk size is dynamically adjusted based on the total number of tokens. - Embedding: Each text chunk is converted into a numerical vector (embedding) using the
all-MiniLM-L6-v2sentence transformer model. These embeddings capture the semantic meaning of the text. - Vector Store: The embeddings and their corresponding text chunks are stored in a FAISS vector store, which allows for fast similarity searches.
- Question & Search: When the user asks a question, the application generates an embedding for the question and uses FAISS to find the most relevant text chunks from the PDF based on semantic similarity.
- Contextual Prompting: The relevant text chunks are combined with the user's question into a prompt for the Google Gemini model. The prompt explicitly instructs the model to answer based only on the provided context.
- Answer Generation: The Gemini model processes the prompt and generates an answer based on the retrieved context.
- Display: The generated answer is displayed to the user in the Streamlit interface.
- Python 3.9+
- Pip (Python package installer)
- Docker (Optional, for containerized deployment)
- Google Gemini API Key
-
Clone the Repository:
git clone <your-repository-url> cd pdfchatai
-
Install Dependencies: It's recommended to use a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
Install the required packages using
pipand thepyproject.tomlfile:pip install .Alternatively, if you create a
requirements.txtfile, you can use:pip install -r requirements.txt
The application requires a Google Gemini API key.
- Create a file named
.envin the root directory of the project (/Users/namnam/PycharmProjects/pdfchatai/.env). - Add your Google API key to the
.envfile:GOOGLE_API_KEY="...YOUR_API_KEY..."
- You can obtain an API key from Google AI Studio.
Important: The .env file is included in the .gitignore file to prevent accidentally committing your secret API key to version control.
https://daretny-pdfchatai-main-bl4cws.streamlit.app/
-
Build the Docker Image: Make sure Docker Desktop or Docker Engine is running. Navigate to the project's root directory in your terminal.
docker build -t pdfchatai . -
Run the Docker Container: You need to pass the
GOOGLE_API_KEYas an environment variable to the container during runtime.docker run -p 8501:8501 -e GOOGLE_API_KEY="AIzaSy...YOUR_API_KEY..." --name pdf-qa-app pdfchatai-p 8501:8501: Maps port 8501 on your host machine to port 8501 inside the container.-e GOOGLE_API_KEY="...": Sets the environment variable inside the container. Replace"AIzaSy...YOUR_API_KEY..."with your actual API key.--name pdf-qa-app: Assigns a name to the running container (optional).pdfchatai: The name of the image to run.
The application will be accessible at
http://localhost:8501.
- Backend: Python
- Web Framework: Streamlit
- LLM: Google Gemini Pro (
google-generativeai) - Text Processing & Orchestration: LangChain (
langchain,langchain-community) - PDF Parsing: PyPDF2
- Text Splitting: Tiktoken, LangChain TokenTextSplitter
- Embeddings: Sentence Transformers (
sentence-transformers), HuggingFaceEmbeddings (langchain_community) - Vector Database: FAISS (
faiss-cpu) - Dependency Management: Pip,
pyproject.toml - Containerization: Docker
- Error Handling: More robust error handling for PDF parsing and API calls.
- Caching: Implement caching for embeddings to speed up processing for previously seen PDFs.
- Asynchronous Processing: Handle PDF processing and API calls asynchronously for a more responsive UI, especially for large files.
- Chat History: Maintain a conversation history for follow-up questions.
- Alternative Embedders/LLMs: Allow selection of different embedding models or LLMs.
- GPU Support: Configure
faiss-gpuand necessary CUDA dependencies in Docker for faster FAISS indexing/search on compatible hardware. - Deployment: Instructions for deploying to cloud platforms (e.g., Streamlit Community Cloud, Hugging Face Spaces, AWS, GCP, Azure).
Consider adding a LICENSE file (e.g., MIT, Apache 2.0) if you plan to share this project publicly.