Thanks to visit codestin.com
Credit goes to github.com

Skip to content

suhastr/Machine_Generated_Code_Detection

Repository files navigation

Machine Generated Code Detection

This repository provides a solution for detecting machine-generated code using AI-based models. It employs pretrained language models and fine-tuning techniques to analyze whether a given piece of code is AI-generated or human-written.

Table of Contents


Overview

The project leverages transformer models from Hugging Face to determine the origin of code (machine-generated vs. human-written). It provides a Flask-based backend for serving the analysis and a minimalistic HTML frontend for interacting with the API. This project builds upon the research presented in the paper Binoculars. While the original implementation of Binoculars lacked the capability to detect AI-generated code, we have extended its functionality to include robust AI code detection.


Features

  • Model Integration: Uses Hugging Face's pretrained models (e.g., SmolLM-360M) for analysis.
  • Frontend: A simple HTML page to upload code and display the results.
  • Backend API: Flask server that processes the requests and returns AI analysis results.
  • Custom Model Fine-Tuning: Scripts for fine-tuning the models using specific datasets.
  • Cross-Origin Resource Sharing (CORS): Enables integration with external services.

Frameworks and Libraries

  • transformers (by Hugging Face) for pretrained BERT model
  • torch for model training and inference
  • sklearn for evaluation metrics

Setup Instructions

Prerequisites

  1. Python 3.8 or higher.
  2. A valid Hugging Face authentication token.
  3. GPU support must for running large models.

Installation

  1. Clone the repository:

    git clone https://github.com/your_username/Machine_Generated_Code_Detection.git
    cd Machine_Generated_Code_Detection
  2. Create .gitignore file:

    # Ignore Python virtual environments
    venv/
    __pycache__/
    
    # Ignore Hugging Face token
    hugging_face_auth_token.txt
  3. Create a virtual environment:

    python3 -m venv env_name
  4. Activate the Virtual Environment:

    source env_name/bin/activate
  5. Install required Python packages:

    pip install -r requirements.txt
  6. Add your Hugging Face authentication token:

    • Save the token in the hugging_face_auth_token.txt file.

Usage

Fine tuning the model

  1. Open the file model_fine_tuning.py and make below changes [optional]
  2. Select Model and Dataset of your choice
    # MODEL_TO_FINETUNE = "HuggingFaceTB/SmolLM-360M"
    # MODEL_TO_FINETUNE = "HuggingFaceTB/SmolLM-360M-Instruct"
    # SAVE_NAME = "SmolLM-360M-LORA"
    
    # FINETUNE_DATASET = "ise-uiuc/Magicoder-Evol-Instruct-110K"
    # FINETUNE_DATASET = "bigcode/starcoderdata"
    # FINETUNE_DATASET = "iamtarun/code_instructions_120k_alpaca"
  3. Set the number of epoch of your choice
  4. Execute the file
     python model_fine_tuning.py
  5. After the model finishes fine tuning it is saves the model under fine_tuned_model, creates results and log directories with content. fine_tuned

Running the Server

  1. Start the Flask server:
    python backend.py
  2. Open the frontend in a browser:
    • The server runs by default on http://localhost:5000.

Frontend

  • Paste the code you want to analyze into the text box and click "Analyze Code".
  • The result will display whether the code is AI-generated, along with the confidence score.

Folder and File Structure


API Endpoints

/analyze

  • Method: POST
  • Description: Analyze the submitted code to determine if it is machine-generated.
  • Request Format:
    {
        "content": "<code to analyze>",
        "type": "code"
    }
  • Response Format:
    {
        "codeclassifier": {
            "is_ai_generated": "yes/no",
            "score": 0.95,
            "result": "AI Generated (Score: 0.9500)"
        }
    }

Datasets

Dataset Description

  • The project uses datasets containing human-written and machine-generated code for model training and validation which was generated using the research paper.
  • Sources: Open-source repositories, GPT-generated code snippets, and research_paper.
  • Format: JSON or text files, where each entry contains:
    • Code snippet.
    • Label specifying if it's machine-generated (1) or human-written (0).

Test

  • Integration Testing: This test is performed by calling codeclassifier file which reads in the input from the test_prompt.txt.
  • System Testing: The tests for the code detection pipeline (code_detector_validation_pipeline.py) are provided in validate_dataset/TestDataset.csv.
  • Test Cases:
    • Valid machine-generated code is labelled has 1.
    • Valid human-written code is labelled has 0.

Evaluation

  • Accuracy, precision, recall, and confusion matrix plotted via matplotlib & seaborn.

Results

  • The model achieves 87% accuracy in distinguishing machine-generated code from human-written code. roc_auc

Challenges

  • Optimizing Fine-Tuning with Limited Resources:
    The model fine-tuning process was constrained by limited GPU, CPU, and computational resources. As a result, we were able to fine-tune the model over a limited number of epochs.

  • Long Training Times vs. Resource Availability:
    Fine-tuning the model for 3 epochs required approximately 18 hours. However, the project was executed on a Hopper system, where the maximum session availability was restricted to 12 hours, presenting a significant challenge.

  • Hyperparameter Optimization and Threshold Tuning:
    Since the algorithms were implemented from scratch with custom improvements, determining the optimal thresholds and hyperparameters to accurately detect AI-generated content was a challenging, highly experimental process.

  • Curating High-Quality Datasets:
    Identifying and sourcing high-quality datasets with a balanced mix of human-generated and machine-generated code required significant effort.

  • Addressing Dataset Bias:
    Special attention was given to mitigating potential biases present in machine-generated code datasets to ensure fairness and accuracy in the model’s predictions.


Hardware Resource

  • NVIDIA A100 (80GB VRAM).
  • only 1 GPU 40GB was available per session.
  • HPC

Conclusion

This project demonstrates the feasibility of detecting machine-generated code using state-of-the-art transformer models. Future work involves refining models, expanding datasets, and deploying the solution in production environments.


Contributors

  • Suhas
  • Manish
  • Kashish

References

Citation

@article{hans2024spotting,
  title={Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text},
  author={Hans, Abhimanyu and Schwarzschild, Avi and Cherepanova, Valeriia and Kazemi, Hamid and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom},
  journal={arXiv preprint arXiv:2401.12070},
  year={2024}
}

Releases

No releases published

Packages

No packages published