Home

rag-over-pdf

Production-shaped RAG starter over PDFs: hybrid search, reranking, and streaming answers with page-level citations.

Built by Sarma Linux. MIT licence.

What this is

Upload one or many PDFs. The app extracts text page by page, chunks it, embeds it, and answers questions across the documents. Retrieval is hybrid (dense embeddings plus BM25, fused with Reciprocal Rank Fusion) and then reranked for precision. Answers stream in with citations that point back to the exact source and page.

This is a readable, framework-free RAG implementation that does what the serious systems do without hiding any of it. The whole retrieval pipeline is a few hundred lines you can read end to end.

Who this is for

Teams who have internal PDFs nobody reads and want them searchable and citable.
Builders prototyping a documentation chatbot grounded in actual content, with sources.
Engineers who want to understand how production-grade RAG works under the hood without a framework hiding it.

Key features

Hybrid search. Dense cosine plus BM25, fused with RRF. Dense handles meaning, BM25 handles exact terms. See How-RAG-Works.
Reranker step. A wide recall pool is reordered for precision by an LLM reranker, with a deterministic lexical fallback.
Citation streaming. The chat response is NDJSON: citations first, then answer tokens, then a done event.
Multi-document chat. Index many PDFs, ask across all of them or scope to a subset.
Page-level highlights. Every chunk knows its page; every citation carries the source filename, page number, and a snippet.

Stack

Next.js 14 App Router, TypeScript, pdf-parse (page-by-page), OpenAI text-embedding-3-small and gpt-4o-mini, an in-repo BM25 index, in-memory cosine similarity, Tailwind CSS.

How it fits together

There are two pipelines and they share one store.

Indexing. POST /api/upload receives a PDF, extracts its text page by page with the pdf-parse pagerender hook (lib/pdf.ts), and splits the joined text into overlapping fixed-size chunks while tracking the page each chunk starts on (lib/chunker.ts). Each chunk is embedded in one batched call and written into the store (lib/vector-store.ts) as part of its own document. Uploading a second PDF adds to the store rather than replacing it, so the store holds many documents at once. The BM25 index rebuilds its corpus on every change.

Query. POST /api/chat embeds the question, runs hybrid retrieval (dense cosine plus BM25, fused with RRF) over a wide candidate pool, and hands that pool to the reranker (lib/reranker.ts), which reorders it for precision and trims to top-k (lib/retrieval.ts). The selected chunks are numbered and placed in the prompt as grounding context, and the model is told to cite the passages it uses. The response streams as NDJSON: a citation event first, then answer tokens, then a done event (lib/citations.ts).

The seam that matters. Everything routes through the vector store interface (add, search, hybridSearch, documents, clear). The retrieval pipeline never reaches past that interface, so moving from the in-memory index to pgvector, Supabase Vector, Pinecone, or Qdrant is a contained change. See Swap-to-pgvector.

Real-world examples

Internal knowledge base. Index a folder of policy or process PDFs and let staff ask questions in plain language, with citations back to the exact page. Hybrid search means exact terms like form numbers and error codes are found even when the phrasing differs.
Product documentation chatbot. Embed your docs and expose /api/chat behind a small widget so users get answers grounded in your content, with sources.
Multi-contract Q and A. Upload several contracts, ask across all of them, or tick a single document to scope a question ("what is the termination notice period in the supplier agreement").
Teaching production RAG. Read the library files end to end. Hybrid search, reranking, and citations are each a short, readable module with no framework abstraction in the way.

Troubleshooting

OPENAI_API_KEY is not set at request time. The client is constructed lazily so next build succeeds without a key, but requests fail without one. Set OPENAI_API_KEY in .env.local for local runs or in your host's environment for deploys.
PDF has no extractable text. The PDF is image-only or scanned. pdf-parse extracts embedded text, not pixels. Run the file through OCR first, then upload the text-bearing output.
Citations point to the wrong page. Page numbers are resolved from the page each chunk starts on. A chunk that spans a page boundary is attributed to its starting page. If a PDF has unusual internal structure, raise CHUNK_SIZE so chunks align more cleanly with pages.
Answers feel thin or miss obvious content. Retrieval is returning the wrong chunks. Raise TOP_K to widen the kept set, or increase CHUNK_SIZE and CHUNK_OVERLAP so related sentences stay together. See Configuration.
An exact term is not being found. Hybrid search should catch this through BM25. Confirm the term actually appears in the extracted text; image-rendered text will not be indexed.
Index is empty after a restart or redeploy. Expected. The default store is in-memory and clears on restart. Re-upload, or move to pgvector for persistence.
pnpm build fails on a missing module. Run pnpm install against the committed pnpm-lock.yaml. This repo uses pnpm; do not mix in npm or yarn lockfiles.

Wiki pages

Architecture indexing and query flow diagrams, component table, failure modes
Quick-Start clone, install, configure, first question answered
How-RAG-Works hybrid retrieval, reranking, and citations explained
Configuration all env vars, tuning chunk size and top-k
Swap-to-pgvector SQL schema and migration path to Postgres, keeping BM25
Cost-and-Performance per-question cost breakdown and latency tuning
Deployment Vercel one-click, Docker, self-hosted
Roadmap what is shipped and what is next

Repository

github.com/sarmakska/rag-over-pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

rag-over-pdf

What this is

Who this is for

Key features

Stack

How it fits together

Real-world examples

Troubleshooting

Wiki pages

Repository

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally