0% found this document useful (0 votes)

6 views3 pages

Source Code Analysis Using Generative AI

The document discusses the use of Generative AI in automating tasks throughout the Software Development Life Cycle (SDLC), particularly in understanding Python code during development and testing phases. It outlines a solution approach that includes repository cloning, file parsing, chunking strategies, embedding generation, and querying with language models to facilitate code comprehension. Additionally, it highlights enhancements for retrieval processes and provides a use case example demonstrating how the system can answer specific queries about code functionalities.

Uploaded by

vijayantp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views3 pages

Source Code Analysis Using Generative AI

Uploaded by

vijayantp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Source Code Analysis using Generative AI

Problem Statement & Usage:

Gen AI can be used in various SDLC stages to automate various tasks in requirement analysis, design,
development, testing phase.
In Development and testing phase the project team members, testers need to understand the code
and functionalities quickly to get overview of the project. The new project members also need to get
used to the code and project.
We can ingest any python-based repository project consisting of modules, py files and perform QA
on the code asking the functionality /purpose/usage of a particular class, function, code module and
what does it do and what does it contain.

Solution Approach

Main Components:

Repository Cloning: Automatically duplicate the target repository for examination.

File Loading and Parsing: Retrieve and analyze Python files from the repository.

Chunking: Divide the code into smaller, manageable sections while maintaining context.

Embedding Generation: Generate vector representations of code snippets using embedding

models.

Knowledge Base Creation: Store these embeddings in a vector database for fast retrieval.

Querying with LLM: Utilize a language model to answer specific queries about the code.

Chunking Strategy

We have different chunking strategies and techniques depending upon type of data and problem
statement and retrieval complexity.

• Character Splitting - Simple static character chunks of data

• Recursive Character Text Splitting - Recursive chunking based on a list of separators.

• Document Specific Splitting - Various chunking methods for different document types (PDF, Python, Markdown)

• Semantic Splitting - Embedding walk based chunking

• Agentic Splitting - Agentic chunker using propositions based .

Example: Vijayant went to the ground. He likes walking > [ Vijayant went to the ground.', Vijayant likes walking']
Context Aware Chunking:
We have used context aware splitting here which will take care of chunking based on code structure.
It will extract the code inside a python function and keep track of code chunks belonging to particular
function. It will extract various chunks and label them under a particular function or a class.

Embeddings:

We have used Open AI GPT 4 API to generate the dense embeddings and store them in chroma
Vector Db.

We have used also memory in the QA chats by summarizing the history chat and adding that to the
current question being asked. In langchain we have used conversation summary memory.

Retrieval Process ( Pre & Post)

Pre-retrieval optimizations: Before starting the search, the system refines the query for
better results. For example, Query Transformations and Query Routing

Example: A query like "latest advances in AI" could be transformed into "recent
breakthroughs in artificial intelligence" to better target relevant documents.

Example: If the query is about scientific research, routing could direct it to a specialized
database of research papers rather than a general knowledge base.

Enhanced retrieval techniques: During the search, Hybrid Search integrates both keyword
and semantic searches, ensuring a thorough scan. By using Chunking and Vectorization, large
documents are broken into smaller sections that are vectorized.

Post-retrieval refinements: After gathering the information, Reranking and Filtering steps
assess the relevance of the retrieved chunks. Rather than simply selecting the top results,
these processes evaluate the most useful data

We have used MMR (Maximal Marginal Relevance) for retrieval to improve diversity and relevance
of search result. We also have top k to retrieve the top k most similar docs based on cosine
similarity.

For RAG retrieval we have used Langchain conversational retrieval chain pipeline which will take care
of internal prompting and querying.

Use case Example Use Case:

In this example, the system is used to query a codebase to understand specific parts, such as a
machine learning training pipeline. By asking a question like "What is happening in the training
pipeline?", the language model retrieves relevant code snippets and provides a detailed explanation
of the process. The code demonstrates this interaction by running the query and printing the
explanation
question = "what is happening in training pipeline?"

result = qa(question)

print(result['answer'])

question = "what is data ingestion class and what does it do ?"

result = qa(question)

print(result['answer'])

The purpose of the data ingestion class is to ingest data from the source formats , It is located in
components module .This includes the downloading of file from the passed URL and and extracting a
zip into data directory given.

Possible Enhancements
1. Add advanced RAG methods like query transformation and routing methods and post
retrieval by adding more context.
2. Also devise strategy to chunk the code doc strings and comments and store them and create
summary of the doc strings and create embeddings and store in vectorDB.
3. Add metadata also in indexing

School Based Assessment 2024 25 English Grade 6
No ratings yet
School Based Assessment 2024 25 English Grade 6
1 page
Generative AI With Python - Bert Gollnick
100% (1)
Generative AI With Python - Bert Gollnick
708 pages
Software Manual MAS-100 NT & NT Ex en V14.0
No ratings yet
Software Manual MAS-100 NT & NT Ex en V14.0
56 pages
NX NF TipsUndTricks
100% (1)
NX NF TipsUndTricks
12 pages
Eden's Bridge Songs
No ratings yet
Eden's Bridge Songs
6 pages
Neuropsychological Assessment Overview
No ratings yet
Neuropsychological Assessment Overview
10 pages
Maquinas Electricas - Stephen Chapman - Ejercicios
100% (1)
Maquinas Electricas - Stephen Chapman - Ejercicios
22 pages
Practical Research 2
No ratings yet
Practical Research 2
13 pages
LRP English New
No ratings yet
LRP English New
60 pages
SEQA Session 4.1
No ratings yet
SEQA Session 4.1
86 pages
SOA 27001 Controles Aplicados
No ratings yet
SOA 27001 Controles Aplicados
16 pages
Shader Class Implementation Guide
No ratings yet
Shader Class Implementation Guide
3 pages
Method Two: Using The Page Source Code: Lorem Ipsum Generator
No ratings yet
Method Two: Using The Page Source Code: Lorem Ipsum Generator
11 pages
Drop Box
No ratings yet
Drop Box
59 pages
Greek Society Reading-4
No ratings yet
Greek Society Reading-4
5 pages
(OOP) - 01-45 (22-08-2009) Updated
No ratings yet
(OOP) - 01-45 (22-08-2009) Updated
342 pages
Grammar Simple Present Tense
No ratings yet
Grammar Simple Present Tense
9 pages
The Hollywood Standard
0% (2)
The Hollywood Standard
32 pages
Hutchinson Resume
No ratings yet
Hutchinson Resume
2 pages
XML and Web Database
No ratings yet
XML and Web Database
10 pages
Focus2 2E Unit Test Vocabulary Grammar UoE Unit5 GroupB
100% (1)
Focus2 2E Unit Test Vocabulary Grammar UoE Unit5 GroupB
2 pages
Alice in Wonderland - A Critique Paper
No ratings yet
Alice in Wonderland - A Critique Paper
2 pages
Cat-Themed Musical Score
No ratings yet
Cat-Themed Musical Score
9 pages
Introduction + Unit 1 Unit 1 (Cont) Unit 1 (Cont) Unit 2 Unit 2 (Cont)
No ratings yet
Introduction + Unit 1 Unit 1 (Cont) Unit 1 (Cont) Unit 2 Unit 2 (Cont)
38 pages
Resume RS
No ratings yet
Resume RS
1 page
Luxury Living at Sainamaha Panvel
No ratings yet
Luxury Living at Sainamaha Panvel
9 pages
Graph vs Vector in RAG for QA
No ratings yet
Graph vs Vector in RAG for QA
69 pages
Magnetism
No ratings yet
Magnetism
30 pages
Types of Sentence Structures
No ratings yet
Types of Sentence Structures
6 pages
New Text Document
No ratings yet
New Text Document
3 pages
Bhavnesh Baghel's Resume
No ratings yet
Bhavnesh Baghel's Resume
2 pages
Advanced Search Techniques Guide
No ratings yet
Advanced Search Techniques Guide
16 pages
Endsem Deep Learning Important
No ratings yet
Endsem Deep Learning Important
2 pages
Mathworks - Yann Debray - GPT-4o
No ratings yet
Mathworks - Yann Debray - GPT-4o
17 pages
Jayant Aiiii
No ratings yet
Jayant Aiiii
23 pages
Echoes of The Red River
No ratings yet
Echoes of The Red River
2 pages
IMO 2024 Notes
No ratings yet
IMO 2024 Notes
18 pages
NLP Mini Project
No ratings yet
NLP Mini Project
19 pages
LangChain Document Loading Guide
No ratings yet
LangChain Document Loading Guide
8 pages
Documentacao Langchain
No ratings yet
Documentacao Langchain
53 pages
Search Engine Report
No ratings yet
Search Engine Report
5 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
LLM For QnA Proposal
No ratings yet
LLM For QnA Proposal
12 pages
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
No ratings yet
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
66 pages
AI Applications
No ratings yet
AI Applications
4 pages
Another Side of Life
No ratings yet
Another Side of Life
960 pages
Brolly AI - Generative AI - Online Training
No ratings yet
Brolly AI - Generative AI - Online Training
13 pages
Practical RAG
No ratings yet
Practical RAG
127 pages
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
No ratings yet
Thesis RAG Retrieval Augmented Generation For The IR-Anthology
83 pages
Gen Project
No ratings yet
Gen Project
7 pages
ManojSH (4 0)
No ratings yet
ManojSH (4 0)
4 pages
Generative Certification Notes-1
No ratings yet
Generative Certification Notes-1
22 pages
AI & Data Science Enthusiast Profile
No ratings yet
AI & Data Science Enthusiast Profile
1 page
Python - Genai - Intqa 2
No ratings yet
Python - Genai - Intqa 2
5 pages
Reading:: Sources
No ratings yet
Reading:: Sources
15 pages
Mini Project Docubot Power Point
No ratings yet
Mini Project Docubot Power Point
17 pages
Improving Retrieval Augmented Generation
No ratings yet
Improving Retrieval Augmented Generation
33 pages
Interview Questions UBS
No ratings yet
Interview Questions UBS
7 pages
Gen AI Use Cases
No ratings yet
Gen AI Use Cases
43 pages
Build An AI Coding Agent With LangGraph by LangChain
No ratings yet
Build An AI Coding Agent With LangGraph by LangChain
11 pages
Session 14 Generative AI - For Software Engineering
No ratings yet
Session 14 Generative AI - For Software Engineering
22 pages
Updated Project File
No ratings yet
Updated Project File
77 pages
Challenges and Paths Towards AI For Software Engineering
No ratings yet
Challenges and Paths Towards AI For Software Engineering
76 pages
Langchain N VDB
No ratings yet
Langchain N VDB
6 pages
LLM1
No ratings yet
LLM1
7 pages
Synopsis
No ratings yet
Synopsis
3 pages
01 Functional Requirements CV Projects-3
No ratings yet
01 Functional Requirements CV Projects-3
7 pages
Projects Groups Spring 24-25
No ratings yet
Projects Groups Spring 24-25
3 pages
Generative Adversarial Networks
No ratings yet
Generative Adversarial Networks
43 pages
An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain For Enhanced Data Retrieval (#1602597) - 4445287
No ratings yet
An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain For Enhanced Data Retrieval (#1602597) - 4445287
13 pages
Advanced LangChain AI Assistant Framework For Comp
No ratings yet
Advanced LangChain AI Assistant Framework For Comp
7 pages
GenAI PDF
No ratings yet
GenAI PDF
34 pages
Project
No ratings yet
Project
2 pages
Anas Anwer
No ratings yet
Anas Anwer
2 pages
An Effective Query System Using Llms and Langchain IJERTV12IS060161
No ratings yet
An Effective Query System Using Llms and Langchain IJERTV12IS060161
4 pages
Knowledge Retrieval Engine Project Plan
No ratings yet
Knowledge Retrieval Engine Project Plan
2 pages
Presentation 2 K
No ratings yet
Presentation 2 K
12 pages
Generative AI Course Topics
No ratings yet
Generative AI Course Topics
3 pages
Kartik Sehgal's Resume
No ratings yet
Kartik Sehgal's Resume
1 page
Gen Ai Lab
No ratings yet
Gen Ai Lab
4 pages
Langchain Guide
No ratings yet
Langchain Guide
11 pages
CalQuity AI Assignment - Internshala
No ratings yet
CalQuity AI Assignment - Internshala
4 pages
Generativeai Masters Content
No ratings yet
Generativeai Masters Content
21 pages
Prompt Engineering & Ai
No ratings yet
Prompt Engineering & Ai
22 pages
Data Engineer Generative Ai
No ratings yet
Data Engineer Generative Ai
17 pages
Langchain 1 Complete
No ratings yet
Langchain 1 Complete
11 pages