This repository provides a solution for detecting machine-generated code using AI-based models. It employs pretrained language models and fine-tuning techniques to analyze whether a given piece of code is AI-generated or human-written.
- Overview
- Features
- Frameworks and Libraries
- Setup Instructions
- Usage
- Folder and File Structure
- API Endpoints
- Datasets
- Test
- Evaluation
- Results
- Challenges
- Hardware Resource
- Conclusion
- Contributors
- References
The project leverages transformer models from Hugging Face to determine the origin of code (machine-generated vs. human-written). It provides a Flask-based backend for serving the analysis and a minimalistic HTML frontend for interacting with the API. This project builds upon the research presented in the paper Binoculars. While the original implementation of Binoculars lacked the capability to detect AI-generated code, we have extended its functionality to include robust AI code detection.
- Model Integration: Uses Hugging Face's pretrained models (e.g.,
SmolLM-360M
) for analysis. - Frontend: A simple HTML page to upload code and display the results.
- Backend API: Flask server that processes the requests and returns AI analysis results.
- Custom Model Fine-Tuning: Scripts for fine-tuning the models using specific datasets.
- Cross-Origin Resource Sharing (CORS): Enables integration with external services.
transformers
(by Hugging Face) for pretrained BERT modeltorch
for model training and inferencesklearn
for evaluation metrics
- Python 3.8 or higher.
- A valid Hugging Face authentication token.
- GPU support must for running large models.
-
Clone the repository:
git clone https://github.com/your_username/Machine_Generated_Code_Detection.git cd Machine_Generated_Code_Detection
-
Create .gitignore file:
# Ignore Python virtual environments venv/ __pycache__/ # Ignore Hugging Face token hugging_face_auth_token.txt
-
Create a virtual environment:
python3 -m venv env_name
-
Activate the Virtual Environment:
source env_name/bin/activate
-
Install required Python packages:
pip install -r requirements.txt
-
Add your Hugging Face authentication token:
- Save the token in the
hugging_face_auth_token.txt
file.
- Save the token in the
- Open the file model_fine_tuning.py and make below changes [optional]
- Select Model and Dataset of your choice
# MODEL_TO_FINETUNE = "HuggingFaceTB/SmolLM-360M" # MODEL_TO_FINETUNE = "HuggingFaceTB/SmolLM-360M-Instruct" # SAVE_NAME = "SmolLM-360M-LORA" # FINETUNE_DATASET = "ise-uiuc/Magicoder-Evol-Instruct-110K" # FINETUNE_DATASET = "bigcode/starcoderdata" # FINETUNE_DATASET = "iamtarun/code_instructions_120k_alpaca"
- Set the number of epoch of your choice
- Execute the file
python model_fine_tuning.py
- After the model finishes fine tuning it is saves the model under fine_tuned_model, creates results and log directories with content.
- Start the Flask server:
python backend.py
- Open the frontend in a browser:
- The server runs by default on
http://localhost:5000
.
- The server runs by default on
- Paste the code you want to analyze into the text box and click "Analyze Code".
- The result will display whether the code is AI-generated, along with the confidence score.
- Method:
POST
- Description: Analyze the submitted code to determine if it is machine-generated.
- Request Format:
{ "content": "<code to analyze>", "type": "code" }
- Response Format:
{ "codeclassifier": { "is_ai_generated": "yes/no", "score": 0.95, "result": "AI Generated (Score: 0.9500)" } }
- The project uses datasets containing human-written and machine-generated code for model training and validation which was generated using the research paper.
- Sources: Open-source repositories, GPT-generated code snippets, and research_paper.
- Format: JSON or text files, where each entry contains:
- Code snippet.
- Label specifying if it's machine-generated (1) or human-written (0).
- Integration Testing: This test is performed by calling codeclassifier file which reads in the input from the
test_prompt.txt
. - System Testing: The tests for the code detection pipeline (code_detector_validation_pipeline.py) are provided in
validate_dataset/TestDataset.csv
. - Test Cases:
- Valid machine-generated code is labelled has 1.
- Valid human-written code is labelled has 0.
- Accuracy, precision, recall, and confusion matrix plotted via
matplotlib
&seaborn
.
-
Optimizing Fine-Tuning with Limited Resources:
The model fine-tuning process was constrained by limited GPU, CPU, and computational resources. As a result, we were able to fine-tune the model over a limited number of epochs. -
Long Training Times vs. Resource Availability:
Fine-tuning the model for 3 epochs required approximately 18 hours. However, the project was executed on a Hopper system, where the maximum session availability was restricted to 12 hours, presenting a significant challenge. -
Hyperparameter Optimization and Threshold Tuning:
Since the algorithms were implemented from scratch with custom improvements, determining the optimal thresholds and hyperparameters to accurately detect AI-generated content was a challenging, highly experimental process. -
Curating High-Quality Datasets:
Identifying and sourcing high-quality datasets with a balanced mix of human-generated and machine-generated code required significant effort. -
Addressing Dataset Bias:
Special attention was given to mitigating potential biases present in machine-generated code datasets to ensure fairness and accuracy in the model’s predictions.
- NVIDIA A100 (80GB VRAM).
- only 1 GPU 40GB was available per session.
- HPC
This project demonstrates the feasibility of detecting machine-generated code using state-of-the-art transformer models. Future work involves refining models, expanding datasets, and deploying the solution in production environments.
- Suhas
- Manish
- Kashish
- Hugging Face Transformers: https://huggingface.co/transformers/
- PyTorch: https://pytorch.org/
@article{hans2024spotting,
title={Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text},
author={Hans, Abhimanyu and Schwarzschild, Avi and Cherepanova, Valeriia and Kazemi, Hamid and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom},
journal={arXiv preprint arXiv:2401.12070},
year={2024}
}