PDF-Q-A-CHATBOT

Chatbot using RAG with Different Chunking Methods and Embedding Methods

📘 PDF Q&A Bot — Chunking & Embedding Playground
A Streamlit-powered application that allows you to upload a PDF, chunk its content, generate embeddings using multiple providers (Gemini, OpenAI, Cohere, HuggingFace), store them in a FAISS vector index, and ask natural language questions whose answers are retrieved from the PDF content.

🚀 Features

📂 Upload any PDF and extract its text.

🪓 Multiple chunking methods:

Simple split
Character-based split
Recursive character split
Token-based split

🧠 Multiple embedding backends:

Google Gemini
OpenAI (yet to improve)
HuggingFace Sentence Transformers (yet to improve)
Cohere

🔍 Semantic search using FAISS for fast retrieval.

📚 How It Works

Text Extraction → PDF text is extracted using pypdf.
Chunking → Text is split into smaller segments to fit embedding model limits.
Embedding → Selected embedding model converts chunks into numerical vectors.
Indexing → FAISS stores the vectors for efficient similarity search.
Retrieval → On a query, similar chunks are retrieved.
Answer Generation → Model uses retrieved chunks as context to answer.

📊 Results

🔹 Chunking Methods Comparison

Chunking Method	Description	Pros	Cons
Simple	Splits text into fixed-size chunks.	Fast, easy to implement.	May break sentences or lose context.
Character	Splits by a set number of characters.	Good for uniform text length.	Can split mid-word or mid-sentence.
Recursive	Splits respecting paragraphs/sentences first.	Preserves context, better for QA.	Slightly slower.
Token	Splits based on LLM token count.	Prevents token overflow errors.	Requires tokenizer, more complex.

🔹 Embedding Methods Comparison

Embedding Provider	Model Used	Strengths	Limitations
Google Gemini	`models/embedding-001`	High quality, multilingual, good semantic understanding.	Requires Google API key.
OpenAI	`text-embedding-3-small`	High accuracy, widely supported in LangChain.	Paid API after free tier.
Hugging Face	`sentence-transformers/all-MiniLM-L6-v2`	Free, runs locally, no API needed.	Slightly slower on large docs.
Cohere	`embed-english-light-v3.0`	Fast, optimized for English.	Limited multilingual support.

🔹 Chunk Size & Chunk Overlap

Chunk Size → The maximum number of characters or tokens in each chunk.
- Example: chunk_size = 500 → each chunk is ~500 characters/tokens.
- Effect: Larger chunks = more context per query, but risk hitting model token limits.
Chunk Overlap → The number of characters/tokens that overlap between chunks.
- Example: chunk_overlap = 50 → 50 characters/tokens repeated between consecutive chunks.
- Effect: Higher overlap = better context retention between chunks, but more processing cost.

📷 Output Screenshot

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Sample_PDF.pdf		Sample_PDF.pdf
multichatbot.py		multichatbot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-Q-A-CHATBOT

Chatbot using RAG with Different Chunking Methods and Embedding Methods

🚀 Features

📚 How It Works

📊 Results

🔹 Chunking Methods Comparison

🔹 Embedding Methods Comparison

🔹 Chunk Size & Chunk Overlap

📷 Output Screenshot

About

Uh oh!

Releases

Packages

Languages

Kamalesh-vlk/PDF-Q-A-CHATBOT

Folders and files

Latest commit

History

Repository files navigation

PDF-Q-A-CHATBOT

Chatbot using RAG with Different Chunking Methods and Embedding Methods

🚀 Features

📚 How It Works

📊 Results

🔹 Chunking Methods Comparison

🔹 Embedding Methods Comparison

🔹 Chunk Size & Chunk Overlap

📷 Output Screenshot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages