LegalQA is a research project exploring Retrieval-Augmented Generation (RAG) for legal document question answering. It compares single-pass RAG pipelines against a multi-hop iterative retrieval agent on U.S. Supreme Court case law from Case law access (CAP) project.
Features Dataset: Built from the Caselaw Access Project (U.S. Supreme Court opinions, 1984–2014). https://static.case.law/us/
Data Processing:
- Normalize raw case JSON into a slim format (cases_slim.jsonl).
- Chunk opinions into passages for retrieval.
Retrieval: FAISS index with bge-small-en embeddings, plus optional cross-encoder reranker.
Pipelines:
- Baseline: Single-pass retrieval → prompt LLM with top-k chunks.
- Iterative Agent: Multi-step reasoning loop: retrieval → self-check → query refinement → final answer.
Evaluation Metrics:
- Answer semantic similarity (vs. gold QA set).
- Citation precision & recall.
- Hallucination rate.
- Hop effectiveness (did extra retrieval help?).
Workflow:
- Data Prep: data/raw/json/*.json → data/processed/cases_slim.jsonl → data/processed/chunks.jsonl
- Indexing: Embed chunks with BAAI/bge-small-en-v1.5 and store in FAISS.
- QA Pipelines:retrieve(query) → build_prompt() → ask_llm(). Then, iterative_agent(query) with self-check & refinement.
Evaluation:
Compare baseline vs iterative using gold_qa.jsonl.
Example:
query = "What did the Supreme Court say about international child abduction?"
retrieved = retrieve(query, top_k=3)
prompt = build_prompt(query, retrieved)
answer = ask_llm(prompt)
print(answer)
Results (sample):
- Baseline: Higher semantic similarity to gold answers.
- Iterative: More correct citations, but sometimes drifts in phrasing.
Tech Stack:
- Python, FAISS, pandas
- SentenceTransformers (BAAI/bge-small-en, all-MiniLM-L6-v2)
- OpenAI GPT models for answering and self-checking
Future Work:
- Use more sophisticated rerankers.
- Enrich gold answers with full case syllabi.
- Evaluate on other datasets (e.g., EUR-Lex, CUAD).