-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Production-shaped RAG starter over PDFs: hybrid search, reranking, and streaming answers with page-level citations.
Built by Sarma Linux. MIT licence.
Upload one or many PDFs. The app extracts text page by page, chunks it, embeds it, and answers questions across the documents. Retrieval is hybrid (dense embeddings plus BM25, fused with Reciprocal Rank Fusion) and then reranked for precision. Answers stream in with citations that point back to the exact source and page.
This is a readable, framework-free RAG implementation that does what the serious systems do without hiding any of it. The whole retrieval pipeline is a few hundred lines you can read end to end.
- Teams who have internal PDFs nobody reads and want them searchable and citable.
- Builders prototyping a documentation chatbot grounded in actual content, with sources.
- Engineers who want to understand how production-grade RAG works under the hood without a framework hiding it.
- Hybrid search. Dense cosine plus BM25, fused with RRF. Dense handles meaning, BM25 handles exact terms. See How-RAG-Works.
- Reranker step. A wide recall pool is reordered for precision by an LLM reranker, with a deterministic lexical fallback.
- Citation streaming. The chat response is NDJSON: citations first, then answer tokens, then a done event.
- Multi-document chat. Index many PDFs, ask across all of them or scope to a subset.
- Page-level highlights. Every chunk knows its page; every citation carries the source filename, page number, and a snippet.
Next.js 14 App Router, TypeScript, pdf-parse (page-by-page), OpenAI text-embedding-3-small and gpt-4o-mini, an in-repo BM25 index, in-memory cosine similarity, Tailwind CSS.
There are two pipelines and they share one store.
Indexing. POST /api/upload receives a PDF, extracts its text page by page with the pdf-parse pagerender hook (lib/pdf.ts), and splits the joined text into overlapping fixed-size chunks while tracking the page each chunk starts on (lib/chunker.ts). Each chunk is embedded in one batched call and written into the store (lib/vector-store.ts) as part of its own document. Uploading a second PDF adds to the store rather than replacing it, so the store holds many documents at once. The BM25 index rebuilds its corpus on every change.
Query. POST /api/chat embeds the question, runs hybrid retrieval (dense cosine plus BM25, fused with RRF) over a wide candidate pool, and hands that pool to the reranker (lib/reranker.ts), which reorders it for precision and trims to top-k (lib/retrieval.ts). The selected chunks are numbered and placed in the prompt as grounding context, and the model is told to cite the passages it uses. The response streams as NDJSON: a citation event first, then answer tokens, then a done event (lib/citations.ts).
The seam that matters. Everything routes through the vector store interface (add, search, hybridSearch, documents, clear). The retrieval pipeline never reaches past that interface, so moving from the in-memory index to pgvector, Supabase Vector, Pinecone, or Qdrant is a contained change. See Swap-to-pgvector.
- Internal knowledge base. Index a folder of policy or process PDFs and let staff ask questions in plain language, with citations back to the exact page. Hybrid search means exact terms like form numbers and error codes are found even when the phrasing differs.
-
Product documentation chatbot. Embed your docs and expose
/api/chatbehind a small widget so users get answers grounded in your content, with sources. - Multi-contract Q and A. Upload several contracts, ask across all of them, or tick a single document to scope a question ("what is the termination notice period in the supplier agreement").
- Teaching production RAG. Read the library files end to end. Hybrid search, reranking, and citations are each a short, readable module with no framework abstraction in the way.
-
OPENAI_API_KEY is not setat request time. The client is constructed lazily sonext buildsucceeds without a key, but requests fail without one. SetOPENAI_API_KEYin.env.localfor local runs or in your host's environment for deploys. -
PDF has no extractable text. The PDF is image-only or scanned.pdf-parseextracts embedded text, not pixels. Run the file through OCR first, then upload the text-bearing output. -
Citations point to the wrong page. Page numbers are resolved from the page each chunk starts on. A chunk that spans a page boundary is attributed to its starting page. If a PDF has unusual internal structure, raise
CHUNK_SIZEso chunks align more cleanly with pages. -
Answers feel thin or miss obvious content. Retrieval is returning the wrong chunks. Raise
TOP_Kto widen the kept set, or increaseCHUNK_SIZEandCHUNK_OVERLAPso related sentences stay together. See Configuration. - An exact term is not being found. Hybrid search should catch this through BM25. Confirm the term actually appears in the extracted text; image-rendered text will not be indexed.
- Index is empty after a restart or redeploy. Expected. The default store is in-memory and clears on restart. Re-upload, or move to pgvector for persistence.
-
pnpm buildfails on a missing module. Runpnpm installagainst the committedpnpm-lock.yaml. This repo uses pnpm; do not mix in npm or yarn lockfiles.
- Architecture indexing and query flow diagrams, component table, failure modes
- Quick-Start clone, install, configure, first question answered
- How-RAG-Works hybrid retrieval, reranking, and citations explained
- Configuration all env vars, tuning chunk size and top-k
- Swap-to-pgvector SQL schema and migration path to Postgres, keeping BM25
- Cost-and-Performance per-question cost breakdown and latency tuning
- Deployment Vercel one-click, Docker, self-hosted
- Roadmap what is shipped and what is next