Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Chatbot using RAG with different type of Chunking Methods and Embedding Methods

Notifications You must be signed in to change notification settings

Kamalesh-vlk/PDF-Q-A-CHATBOT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF-Q-A-CHATBOT

Chatbot using RAG with Different Chunking Methods and Embedding Methods

πŸ“˜ PDF Q&A Bot β€” Chunking & Embedding Playground
A Streamlit-powered application that allows you to upload a PDF, chunk its content, generate embeddings using multiple providers (Gemini, OpenAI, Cohere, HuggingFace), store them in a FAISS vector index, and ask natural language questions whose answers are retrieved from the PDF content.


πŸš€ Features

πŸ“‚ Upload any PDF and extract its text.

πŸͺ“ Multiple chunking methods:

  • Simple split
  • Character-based split
  • Recursive character split
  • Token-based split

🧠 Multiple embedding backends:

  • Google Gemini
  • OpenAI (yet to improve)
  • HuggingFace Sentence Transformers (yet to improve)
  • Cohere

πŸ” Semantic search using FAISS for fast retrieval.


πŸ“š How It Works

  1. Text Extraction β†’ PDF text is extracted using pypdf.
  2. Chunking β†’ Text is split into smaller segments to fit embedding model limits.
  3. Embedding β†’ Selected embedding model converts chunks into numerical vectors.
  4. Indexing β†’ FAISS stores the vectors for efficient similarity search.
  5. Retrieval β†’ On a query, similar chunks are retrieved.
  6. Answer Generation β†’ Model uses retrieved chunks as context to answer.

πŸ“Š Results

πŸ”Ή Chunking Methods Comparison

Chunking Method Description Pros Cons
Simple Splits text into fixed-size chunks. Fast, easy to implement. May break sentences or lose context.
Character Splits by a set number of characters. Good for uniform text length. Can split mid-word or mid-sentence.
Recursive Splits respecting paragraphs/sentences first. Preserves context, better for QA. Slightly slower.
Token Splits based on LLM token count. Prevents token overflow errors. Requires tokenizer, more complex.

πŸ”Ή Embedding Methods Comparison

Embedding Provider Model Used Strengths Limitations
Google Gemini models/embedding-001 High quality, multilingual, good semantic understanding. Requires Google API key.
OpenAI text-embedding-3-small High accuracy, widely supported in LangChain. Paid API after free tier.
Hugging Face sentence-transformers/all-MiniLM-L6-v2 Free, runs locally, no API needed. Slightly slower on large docs.
Cohere embed-english-light-v3.0 Fast, optimized for English. Limited multilingual support.

πŸ”Ή Chunk Size & Chunk Overlap

  • Chunk Size β†’ The maximum number of characters or tokens in each chunk.

    • Example: chunk_size = 500 β†’ each chunk is ~500 characters/tokens.
    • Effect: Larger chunks = more context per query, but risk hitting model token limits.
  • Chunk Overlap β†’ The number of characters/tokens that overlap between chunks.

    • Example: chunk_overlap = 50 β†’ 50 characters/tokens repeated between consecutive chunks.
    • Effect: Higher overlap = better context retention between chunks, but more processing cost.

πŸ“· Output Screenshot

image

About

Chatbot using RAG with different type of Chunking Methods and Embedding Methods

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages