Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views19 pages

Project Paper

The document presents a mid-semester report on a mini project aimed at developing an AI-driven legal platform to enhance access to justice in India. It highlights the challenges of traditional legal support and proposes a comprehensive solution utilizing advanced AI techniques, including a chatbot, hybrid search, and community engagement features. The project emphasizes the importance of bridging technology with legal assistance to create a scalable and inclusive platform for users.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views19 pages

Project Paper

The document presents a mid-semester report on a mini project aimed at developing an AI-driven legal platform to enhance access to justice in India. It highlights the challenges of traditional legal support and proposes a comprehensive solution utilizing advanced AI techniques, including a chatbot, hybrid search, and community engagement features. The project emphasizes the importance of bridging technology with legal assistance to create a scalable and inclusive platform for users.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Indian Institute of Information Technology Surat

Mid Semester Report on


Mini Project - (CS 604)

Submitted by

M.Nagavardhan- ui22cs50

Yogesh Nade- ui22cs51

Faculty Supervisor

Dr. Shraddha Patel

Department of Computer Science and


Engineering Indian Institute of Information
Technology Surat Gujarat-394190, India

March - 2025
Acknowledgement

I am deeply grateful to everyone who supported and guided me throughout the journey of
completing this project: Empowering Access to Justice: An AI-Driven Legal Platform for Instant
Assistance, Document Insights, Lawyer Discovery, and Community Support.

I extend my sincere thanks to my faculty supervisor, Dr. [Supervisor's Name], whose thoughtful
insights and steady encouragement were instrumental in shaping both the technical foundation and
broader vision of this project. Their guidance pushed me to think critically and refine my ideas.

I am also truly appreciative of our esteemed director, Dr. Rajeev Shorey, for creating an environment
that nurtures creativity and innovation — a space where ideas grow and bold solutions emerge.

A heartfelt thank you to my friends and peers, whose honest feedback, collaborative mindset, and
engaging discussions added new dimensions to this work. Their support kept me motivated and open
to new perspectives.

Most importantly, I am forever thankful to my family for their unwavering belief in me. Their constant
encouragement gave me the strength to persevere and reminded me of the purpose behind this
project — to build something meaningful and impactful.

This project is not just a technical endeavor but a step toward bridging the gap between technology
and access to justice. I am truly grateful to everyone who played a part in this incredible learning
experience.

1
Abstract

Access to legal assistance in India remains a significant challenge, with millions facing barriers due to
high costs, lack of awareness, and limited access to legal professionals. Traditional legal support relies
on manual consultations, which are time-consuming, expensive, and geographically restrictive. This
study explores the development of an AI-driven legal platform that combines advanced AI techniques
like Retrieval-Augmented Generation (RAG) and graph-based reasoning to bridge this gap.

The project proposes a comprehensive platform with the following features:

●​ An AI-powered chatbot using LLaMA2-7B fine-tuned on Indian legal data.


●​ Hybrid search combining ChromaDB’s semantic retrieval and Neo4j’s
graph-based legal reasoning.
●​ A user-driven community forum to promote legal awareness and collaboration.
●​ Real-time legal updates on laws, amendments, and landmark judgments.
●​ AI-powered document analysis and Q&A on uploaded legal PDFs.

The platform aims to provide a scalable, inclusive, and AI-enhanced legal solution tailored for India’s
unique legal landscape.

Keywords: AI Legal Assistant, Retrieval-Augmented Generation (RAG), ChromaDB,


Neo4j, LLaMA2-7B, Indian Legal Data, Explainable AI, Legal Tech.

2
Table of Contents

S.No Title Page No. Remark


4
1. List of Tables
5
2. List of Figures

3. Abbreviations 6

4. Chapter 1: Problem Statement 7

5. Chapter 2: Literature Survey 8

9
6. Chapter 3:Novelty
10
7. Chapter 4: Methodology
13
8. Chapter 5: Result Analysis

9. Chapter 6: Conclusion 18

3
List of Tables

S.No Table No. Page No.


1. Table:1 Literature Survey 8

4
List of Figures

S.No Figure No. Page No.


10
1. Fig 4.1 RAG Architecture

2. Fig 4.2 RAG Query Flow 11


11
4. Fig 4.3 LLaMA2-7B Fine-Tuning Pipeline
15
5. Fig 4.4 Fine-Tuning LLaMA2-7B
16
6. Fig 4.5 Neo4j diagram

5
Abbreviations /Notations
RAG: Retrieval-Augmented Generation

LLaMA2-7B: Large Language Model Meta AI

AI: Artificial Intelligence

NLP: Natural Language Processing

ChromaDB: Vector Database for Semantic Search

Neo4j: Graph-Based Database

BERT: Bidirectional Encoder Representations from

Transformers LEGAL-BERT: BERT fine-tuned on legal data

RoBERTa: Robustly Optimized BERT Pretraining

Approach TF-IDF: Term Frequency-Inverse Document

Frequency

BM25: Best Matching 25 (ranking function for information retrieval)

VAE: Variational Autoencoder

SAC: Soft Actor-Critic

6
Chapter 1: Problem Statement
Empowering Access to Justice: An AI-Driven Legal Platform for Instant Assistance,
Document Insights, Lawyer Discovery, and Community Support

Access to legal assistance in India remains a critical challenge, with millions struggling to
overcome barriers like high legal costs, lack of legal awareness, and limited access to verified
legal professionals. Traditional methods of legal support often involve manual consultations,
which are time-consuming, costly, and restricted by geographical boundaries. Existing online
legal platforms lack AI-powered real-time assistance, reliable legal document insights, and
community collaboration.

Key challenges identified:

1.​ Limited AI Legal Assistance: Basic guidance without trusted, verifiable


sources leads to misinformation.
2.​ Complex Legal Documents: Users face difficulties in interpreting legal texts.
3.​ Nearby Lawyer Discovery: Inefficient access to verified legal professionals.
4.​ Lack of Community Engagement: Few platforms provide
collaborative legal discussions.
5.​ Delayed Legal Updates: Access to recent laws and landmark judgments is often slow.

7
Chapter 2:Related work , Literature Survey

Resource Year Title Algorithm + Limitation


Concept
Resource1
2024 AI-ML-Based Legal Assistant NLP, RAG, No fine-tuning on
for Contracts Transformer legal QA datasets
Models or Neo4j use.
Resource2
2024 Legal AI for Document Retrieval BERT, Semantic Lacks graph-based
Search case-law
reasoning.
Resource3
2024 Custom GPT Legal Model GPT, OCR, UI No graph-based
integration reasoning,
community Q&A, or
RAG.
Resource4 2024
Transformer-Based Legal BERT, No graph-based
Models LEGAL-BERT, reasoning,
RoBERTa, TF-IDF, real-time updates,
BM25 or generative QA.

Resource5 2024
SAC-VAE for Legal Text VAE for Complexity,
Summarization dimensionality domain-specific
reduction, SAC for focus, and high
policy learning computational
resource
requirements.

Table:1 Literature Survey

8
Chapter 3 : Novelty of Your AI Legal Platform

Round-the-Clock AI Legal Support: Get 24/7 legal assistance powered by LLaMA2-7B,


trained specifically on Indian legal data to provide accurate, context-aware advice tailored to
local laws.

Smart Legal Search with RAG + Neo4j: Combines ChromaDB’s advanced search
technology with Neo4j’s graph-based legal reasoning, offering deeper insights into case
laws and legal precedents.

Effortless Legal Document Analysis: Upload legal PDFs, quickly extract key points, and
ask AI-powered questions — simplifying complex legal jargon for everyone.

Community-Powered Legal Q&A: Join an interactive platform where both legal experts and
the public can ask, answer, and validate legal questions, encouraging collaborative
problem-solving.

Live Legal Updates and Insights: Stay informed with real-time updates on new laws,
important court rulings, and amendments — ensuring you're always up to date.

Accessible and Scalable Legal Aid: Built for individuals, small businesses, and
underrepresented communities, making legal help more inclusive and affordable through AI.

9
Chapter 4:Methodology

Retrieval-Augmented Generation (RAG) Retrieval-Augmented Generation (RAG) combines


information retrieval with text generation models, enhancing their ability to produce
accurate and contextually relevant responses by incorporating external knowledge sources.
Proposed by researchers at Facebook AI, RAG bridges the gap between static language
models and dynamic information retrieval systems.

Fig 4.1: RAG Model Architecture

RAG Workflow
●​ Query Processing
○​ The user's legal query is preprocessed (tokenization, normalization).
○​ Embeddings are generated using a pre-trained
transformer model (LLaMA2-7B).
●​ Retrieval Step
○​ Vector embeddings of legal documents are stored in ChromaDB.
○​ Graph-based relationships between case laws, statutes, and
precedents are maintained in Neo4j.
○​ Hybrid retrieval combines vector similarity search (cosine
similarity) and graph traversal algorithms.
●​ Generation Step
○​ Retrieved documents and nodes are passed as context to
the LLaMA2-7B model.
○​ The model generates human-like, legally grounded responses.
●​ Post-processing
○​ Responses are filtered to remove irrelevant information.
○​ Citations are validated against authoritative sources.

10
Fig 4.2: RAG Query Flow

Fine-Tuning LLaMA2-7B
Fine-tuning involves training the LLaMA2-7B model on domain-specific datasets to
specialize it for legal applications. We used the following datasets:

●​ LawyerChat: A corpus of legal conversations.


●​ FALQU: Frequently Asked Legal Questions dataset.
●​ JEC-QA: Judicial Exam Corpus for legal question-answer

pairs. Fine-Tuning Steps:

1.​ Preprocessing
○​ Tokenizing and formatting the data into instruction-based prompts.
○​ Padding/truncating sequences to match model input size.
2.​ Training
○​ Using LoRA (Low-Rank Adaptation) to fine-tune select
layers without modifying the entire model.
○​ Optimizing with AdamW optimizer and a learning rate scheduler.
3.​ Evaluation
○​ Validating on a hold-out set and calculating metrics like BLEU,
ROUGE, and perplexity

Fig 4.3: LLaMA2-7B Fine-Tuning

11
Pipeline Graph-Based Retrieval with Neo4j

Neo4j, a graph database, helps model legal data by capturing relationships like precedents,
citations, and references.

Graph Schema:

●​ Nodes: Case Laws, Statutes, Legal Principles


●​ Edges: CITES, REFERS_TO, SIMILAR_TO

Traversal Algorithms:

●​ BFS (Breadth-First Search): For exploring case law references.


●​ Personalized PageRank: To rank relevant legal statutes.

Hybrid Retrieval Model

The hybrid model merges vector-based and graph-based retrieval strategies:

Final Score=α×Vector Score+(1−α)×Graph Score


where α alpha is a tunable hyperparameter.

●​ Vector Score: Cosine similarity between query and document embeddings.


●​ Graph Score: Relevance score from Neo4j

traversal Benefits:

●​ Improves result accuracy by leveraging both semantic similarity


and legal relationships.
●​ Mitigates the limitations of pure vector search by incorporating domain knowledge.

12
Chapter 5: Result Analysis
Dataset Used

6.​Indian Legal Datasets for RAG-Based AI

●​ A curated collection of official legal documents, case


laws, and acts from trusted Indian legal sources.
●​ Includes statutes, Supreme Court & High Court
judgments, IPC, CrPC, and legal codes.
●​ Used to generate document embeddings for ChromaDB-based retrieval.

7.​Fine-Tuning Datasets for LLaMA2-7B

●​ LawyerChat: Dataset containing Indian legal Q&A pairs


to improve conversational understanding.
●​ FALQU: A legal dataset covering frequently asked legal queries
and their expert responses.
●​ JEC-QA: Judicial and case-law-based question-answer dataset
used to enhance legal reasoning.
●​ Synthetic Data Generation: AI-generated legal QA pairs for
domain-specific adaptation.

8.​Knowledge Graph Data (Neo4j Integration)

●​ Structured case-law data linking precedents, legal


entities, and statutory provisions.
●​ Enables contextual and relational understanding of legal

documents. Used for graph-based legal retrieval and reasoning.

The core of this legal AI system is Retrieval-Augmented Generation (RAG), which


improves response accuracy by retrieving relevant legal text before generating an
answer. This ensures that the AI does not hallucinate information and instead
grounds responses in trusted legal documents.

How RAG Works in This Project

Step 1: Creating Legal Document Embeddings (ChromaDB)

●​ All legal texts (IPC, CrPC, Constitution, Supreme Court judgments, etc.)
are converted into vector embeddings using sentence-transformers
(like BERT-based models).
●​ These embeddings capture semantic meaning, allowing AI to
retrieve relevant legal information instead of relying on
keyword matches.
●​ The embeddings are stored in ChromaDB, a high-performance vector database
designed for fast similarity searches.

13
Step 2: Semantic Search Using User Queries

●​ When a user asks a legal question (e.g., "What are the bail
provisions under IPC?"), the system converts the query into an
embedding.
●​ This embedding is then matched against ChromaDB’s stored legal
document embeddings to retrieve the most relevant legal
sections, case laws, and provisions.
●​ Unlike traditional legal search engines (which rely on keywords),
this approach allows the system to understand the intent of the
query and fetch the most contextually relevant results.

Step 3: Passing Retrieved Legal Context to LLaMA2-7B for Response Generation

●​ The retrieved legal text is then fed into the LLaMA2-7B model, which is fine-tuned on
Indian legal datasets.
●​ The AI model generates responses based on both:
1.​ The retrieved legal text (retrieved via ChromaDB)
2.​ Its own knowledge from fine-tuning
●​ This reduces hallucination and ensures AI responses are factually accurate and
legally grounded.

14
Fine-Tuning LLaMA2-7B on Indian Legal Datasets

To enhance accuracy, the AI model is fine-tuned on legal question-answer pairs and


structured legal texts.

Fine-Tuning Process

1.​ Dataset Curation: The model is trained using Indian legal datasets, including:

○​ LawyerChat (Legal Q&A dataset)


○​ FALQU (Legal argumentation dataset)
○​ JEC-QA (Judicial case-law question-answer dataset)
○​ Synthetic data (generated using case laws and legal provisions)
2.​ LoRA-Based Training:
○​ Low-Rank Adaptation (LoRA) is used to fine-tune LLaMA2-7B efficiently on
consumer hardware.
○​ 8-bit quantization is applied to reduce memory
usage while preserving accuracy.
3.​ Legal Language Adaptation:

○​ The model is trained to understand legal terminology, citations,


and act references.
○​ AI-generated responses follow a structured legal format (e.g., "As
per IPC Section 376, the punishment for...").
4.​ Evaluation & Optimization:

○​ Model responses are evaluated using BLEU, ROUGE, and BERTScore


to measure correctness.
○​ Human legal experts assess responses to ensure factual accuracy
and coherence.
○​ The best-performing model is integrated into the RAG system.

Fig : 4.4 Fine-Tuning LLaMA2-7B

15
Enhancing Case-Law Retrieval with Neo4j (Graph Database)
To further improve case-law reasoning, the system integrates Neo4j, a graph-based database
that models legal relationships.

Why Use Neo4j for Legal Data?

Legal cases and statutes have complex interconnections (e.g., one case may cite multiple previous
cases). A graph-based approach helps model:
Case citations (Which cases refer to which?)
Act-to-section relationships (Which sections fall under which act?)

Lawyer & judge connections (Which lawyers have worked on which cases?)

How Neo4j Works in This System

1.​ Legal Data Structuring


○​ Court judgments, legal provisions, and case-law citations are converted into
nodes and relationships.
○​ Example: A judgment citing another case is stored as a
“Cites” relationship in Neo4j.
2.​ Graph-Based Querying for Better Case Retrieval
○​ When a user asks a case-law-related question, the system
queries Neo4j to find relevant cases.
○​ This ensures that the AI retrieves relevant precedents,
improving legal reasoning.
3.​ Integration with RAG for Hybrid Search
○​ If a legal question requires a combination of case-law reasoning and
statutory provisions, Neo4j and ChromaDB are queried together, providing
richer legal context.

Fig : 4.5 Neo4j diagram

16
Hybrid Query Processing (Combining Keyword & Semantic Search)
Unlike traditional keyword-based legal search engines, this project employs hybrid search, which
combines:

ChromaDB (Semantic Search) – Finds relevant legal provisions based on meaning.

Neo4j (Graph Search) – Retrieves case-law citations and structured legal relations.

BM25 (Keyword Matching) – Ensures exact legal terms are considered. How Hybrid

Search Works in Legal AI

User Query Understanding

AI determines whether the query is statutory (law-based), case-law-related, or mixed.

Retrieving Relevant Legal Data

If statutory, ChromaDB is queried for legal provisions & acts.

If case-law, Neo4j is queried for similar past cases.

If mixed, both are combined to generate a comprehensive response.

Generating Final Response

Retrieved documents are passed to LLaMA2-7B, which synthesizes a legal response.

AI adds citations & reasoning, making responses more legally sound.

Current Results

●​ Training Loss: 0.073


●​ Validation Loss: 0.128
●​ BLEU Score: 0.81
●​ ROUGE Score: 0.76

Expected Results We aim for the following:

●​ Improved accuracy due to fine-tuning LLaMA2-7B.


●​ Enhanced retrieval precision with hybrid models (combining ChromaDB and Neo4j).
●​ Greater explainability through graph-based legal relationships.

17
Conclusion

This project explores the application of RAG for AI-powered legal assistants.

By integrating LLaMA2-7B, ChromaDB, and Neo4j, we create a system that generates


accurate legal responses by combining semantic search and graph-based reasoning.
Fine-tuning LLaMA2-7B on domain-specific datasets further boosts performance, while
the hybrid retrieval model balances vector and graph search. Future work will focus on
scaling the system with more legal datasets and refining the hybrid retrieval strategy for
better real-world applicability.

Resources
Fig 4.1
:https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.leewayhertz.com
%2Fadvanced-rag%2F&psig=AOvVaw01obkb85CNOCDtc7Fgoz-c&ust=17415747
2 430
9000&source=images&cd=vfe&opi=89978449&ved=0CBQQjRxqFwoTCMiGj7L--
4 sDF QAAAAAdAAAAABA

Fig4.4
:https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.researchgate.net%2Ffigu
r
e%2FThe-approach-to-fine-tuning-the-pretrained-Llama-2-model-for-text-classification_fi
g2_373451301&psig=AOvVaw106m0JGrLSS0CouY8vHhpo&ust=1741575026160000&s
ource=images&cd=vfe&opi=89978449&ved=0CBQQjRxqFwoTCOCem9
-4sDFQAAAA AdAAAAAB

Fig 4.5:
https://www.google.com/url?sa=i&url=https%3A%2F%2Fblog.gopenai.com%2Frag-applic
at i on-
with-neo4j-constructed-knowledge-graphs-and-vector-index-6178c9bb8386&psig=AOvVaw1
pG
4edlDsRWaJ-IAuw2POb&ust=1741575345862000&source=images&cd=vfe&opi=89978449
&ve d=0CBQQjRxqFwoTCLi5vOKA_IsDFQAAAAAdAAAAABAQ

18

You might also like