Great choice!
“Using Large Language Models (LLMs) for Automated Test Case
Generation” is a cutting-edge and impactful research topic. It combines AI, software
engineering, and NLP, and there's a lot of scope for innovation—perfect for a university-level
team project.
🧠 Research Title:
"Exploring the Effectiveness of Large Language Models for Automated Software Test Case
Generation"
📝 Research Plan Outline
🎯 Objective:
Investigate how effective LLMs (like GPT-4, CodeLlama, or Codex) are at generating unit tests,
integration tests, or system tests for codebases in various programming languages (Python,
Java, etc.).
👥 Team Role Breakdown (for 5 Members)
Role Responsibilities
1. Literature Research existing work on LLMs and test generation. Compare tools like
Reviewer Codex, ChatGPT, CodeT5, etc. Summarize findings.
Gather codebases from GitHub, Codeforces, or open-source projects.
2. Dataset Collector
Prepare datasets for test generation.
Prompt and fine-tune LLMs (or use APIs) to generate test cases.
3. Model Engineer
Evaluate different prompting strategies.
Design evaluation metrics (e.g., test coverage, correctness, mutation
4. Evaluation Lead
testing). Run the generated tests and analyze results.
5. Report & Coordinate documentation, write the final report, and prepare the
Presentation Lead presentation/slides. Assist others where needed.
🧪 Research Phases & Timeline
🔹 Phase 1: Background & Literature Review (Week 1–2)
Read 8–10 relevant research papers (see below).
Understand how LLMs are used for code-related tasks.
Study traditional vs. LLM-based test generation.
🔹 Phase 2: Dataset & Tool Setup (Week 2–3)
Collect code snippets or full projects (preferably in Python/Java).
Use GitHub repos, LeetCode/Codeforces problems with solutions, or open-source apps.
Setup tools: OpenAI API (for GPT), Hugging Face (CodeT5), or any open-source LLMs.
🔹 Phase 3: Test Case Generation (Week 3–5)
Try multiple prompting strategies:
o “Write unit tests for the following function…”
o “Generate boundary test cases for this method…”
Compare:
o Zero-shot
o Few-shot (showing 1–2 examples)
o Chain-of-thought prompting
🔹 Phase 4: Evaluation & Analysis (Week 5–7)
Evaluate:
o Code coverage (e.g., using coverage.py)
o Correctness (do the tests catch real bugs?)
o Comparison with human-written tests
o Mutation testing (e.g., using MutPy or PITest)
🔹 Phase 5: Reporting & Presentation (Week 7–8)
Summarize findings in a research report (6–10 pages).
Create visualizations (bar graphs, pie charts for test coverage, etc.).
Prepare a final presentation (15–20 min talk with slides).
📚 Key Research Questions
1. Can LLMs reliably generate syntactically and semantically correct test cases?
2. What kind of prompting techniques give the best results?
3. How does LLM-generated testing compare to traditional auto-testing tools?
4. Can LLMs detect edge cases or just basic scenarios?
5. What are the limitations of using LLMs in real-world CI/CD pipelines?
🧰 Tools & Technologies
LLMs/APIs: OpenAI GPT-4 API, CodeLlama, CodeT5, StarCoder
Languages: Python (easiest for testing), Java
Test Frameworks: unittest, pytest, JUnit
Coverage Tools: coverage.py, pytest-cov, JaCoCo
Mutation Testing: MutPy, PITest
IDE: VS Code, PyCharm
Version Control: Git, GitHub
📖 Suggested Papers & Resources
Research Papers:
[1] "Automated Unit Test Generation with OpenAI's GPT Models" (Arxiv)
[2] "Evaluating Large Language Models for Code Generation and Debugging"
(Google DeepMind)
[3] "LLM4Code: Exploring Test Generation Using Language Models"
[4] "Can ChatGPT Write Effective Unit Tests?"
[5] "CodeXGLUE Benchmark" – useful for datasets and evaluations.
Datasets (optional):
CodeXGLUE
HumanEval
GitHub repos with test folders (e.g., open-source Python projects)
✅ Final Deliverables
1. Research Paper / Report (6–10 pages):
o Introduction, methodology, results, analysis, conclusion
2. Presentation Slides (10–15 slides)
3. Codebase (test scripts, prompt templates, results)
4. Evaluation Metrics Summary (charts, tables)
5. Optional: a short demo video of test generation in action
🚀 Bonus Ideas (if you want to go deeper):
Fine-tune CodeT5 or similar models on your dataset.
Develop a mini GUI or CLI tool that lets users input code and see generated test cases.
Compare performance across languages (e.g., Python vs Java).
Would you like a starter GitHub template or help drafting the prompt templates for GPT-4 to
generate test cases?