Source Code Analysis using Generative AI
Problem Statement & Usage:
Gen AI can be used in various SDLC stages to automate various tasks in requirement analysis, design,
development, testing phase.
In Development and testing phase the project team members, testers need to understand the code
and functionalities quickly to get overview of the project. The new project members also need to get
used to the code and project.
We can ingest any python-based repository project consisting of modules, py files and perform QA
on the code asking the functionality /purpose/usage of a particular class, function, code module and
what does it do and what does it contain.
Solution Approach
Main Components:
Repository Cloning: Automatically duplicate the target repository for examination.
File Loading and Parsing: Retrieve and analyze Python files from the repository.
Chunking: Divide the code into smaller, manageable sections while maintaining context.
Embedding Generation: Generate vector representations of code snippets using embedding
models.
Knowledge Base Creation: Store these embeddings in a vector database for fast retrieval.
Querying with LLM: Utilize a language model to answer specific queries about the code.
Chunking Strategy
We have different chunking strategies and techniques depending upon type of data and problem
statement and retrieval complexity.
• Character Splitting - Simple static character chunks of data
• Recursive Character Text Splitting - Recursive chunking based on a list of separators.
• Document Specific Splitting - Various chunking methods for different document types (PDF, Python, Markdown)
• Semantic Splitting - Embedding walk based chunking
• Agentic Splitting - Agentic chunker using propositions based .
Example: Vijayant went to the ground. He likes walking > [ Vijayant went to the ground.', Vijayant likes walking']
Context Aware Chunking:
We have used context aware splitting here which will take care of chunking based on code structure.
It will extract the code inside a python function and keep track of code chunks belonging to particular
function. It will extract various chunks and label them under a particular function or a class.
Embeddings:
We have used Open AI GPT 4 API to generate the dense embeddings and store them in chroma
Vector Db.
We have used also memory in the QA chats by summarizing the history chat and adding that to the
current question being asked. In langchain we have used conversation summary memory.
Retrieval Process ( Pre & Post)
Pre-retrieval optimizations: Before starting the search, the system refines the query for
better results. For example, Query Transformations and Query Routing
Example: A query like "latest advances in AI" could be transformed into "recent
breakthroughs in artificial intelligence" to better target relevant documents.
Example: If the query is about scientific research, routing could direct it to a specialized
database of research papers rather than a general knowledge base.
Enhanced retrieval techniques: During the search, Hybrid Search integrates both keyword
and semantic searches, ensuring a thorough scan. By using Chunking and Vectorization, large
documents are broken into smaller sections that are vectorized.
Post-retrieval refinements: After gathering the information, Reranking and Filtering steps
assess the relevance of the retrieved chunks. Rather than simply selecting the top results,
these processes evaluate the most useful data
We have used MMR (Maximal Marginal Relevance) for retrieval to improve diversity and relevance
of search result. We also have top k to retrieve the top k most similar docs based on cosine
similarity.
For RAG retrieval we have used Langchain conversational retrieval chain pipeline which will take care
of internal prompting and querying.
Use case Example Use Case:
In this example, the system is used to query a codebase to understand specific parts, such as a
machine learning training pipeline. By asking a question like "What is happening in the training
pipeline?", the language model retrieves relevant code snippets and provides a detailed explanation
of the process. The code demonstrates this interaction by running the query and printing the
explanation
question = "what is happening in training pipeline?"
result = qa(question)
print(result['answer'])
question = "what is data ingestion class and what does it do ?"
result = qa(question)
print(result['answer'])
The purpose of the data ingestion class is to ingest data from the source formats , It is located in
components module .This includes the downloading of file from the passed URL and and extracting a
zip into data directory given.
Possible Enhancements
1. Add advanced RAG methods like query transformation and routing methods and post
retrieval by adding more context.
2. Also devise strategy to chunk the code doc strings and comments and store them and create
summary of the doc strings and create embeddings and store in vectorDB.
3. Add metadata also in indexing