Thanks to visit codestin.com
Credit goes to github.com

Skip to content

This project investigates the detection of AI-generated code versus human-written code, addressing academic integrity concerns in educational settings where generative models (e.g., ChatGPT, Codex) are increasingly used.

License

Notifications You must be signed in to change notification settings

samresume/AI-Code-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

🤖 Detecting AI-Generated code vs Human written

🔍 Overview

This project investigates the detection of AI-generated code versus human-written code, addressing academic integrity concerns in educational settings where generative models (e.g., ChatGPT, Codex) are increasingly used.

Traditional plagiarism tools like MOSS and TF-IDF clustering fail to distinguish between AI vs. human authorship. This notebook builds and compares several classification pipelines that utilize both lexical (TF-IDF) and structural (AST) features, integrating a contrastive learning network with a neural classifier for enhanced discrimination.


📘 Notebook Objective

This notebook:

  • Loads a labeled dataset of AI-generated vs. human-written code
  • Extracts TF-IDF and AST-based vector features
  • Trains machine learning classifiers (e.g., Random Forests)
  • Implements a contrastive learning pipeline with triplet loss
  • Evaluates models using standard classification metrics (Accuracy, F1, AUC)
  • Visualizes learned embeddings using t-SNE and PCA

Simply run the notebook top to bottom. All experiments and visualizations are included.


📂 Files Used


📥 Data Preparation

After downloading data.jsonl, place it in the notebook directory and run:

import json
import pandas as pd

with open('data.jsonl', 'r') as file:
    lines = file.readlines()

new_data = pd.DataFrame([json.loads(line) for line in lines])
display(new_data)

This will load the dataset into a DataFrame for further processing.


🛠️ Libraries Used

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_curve, auc, classification_report, confusion_matrix
)
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import json

⚙️ How to Run

  1. Open project_Cs6890_PLP.ipynb and run all cells.
  2. The notebook will:
    • Train and validate models using stratified 2-fold cross-validation
    • Display classification metrics (Accuracy, Precision, Recall, F1-score, AUC)
    • Visualize embeddings using t-SNE and PCA

📊 Evaluation

Models are evaluated using:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • AUC (Area Under the ROC Curve)

🧠 Author & Notes

Developed for CS6890 - Programming Language Principles (PLP) at Utah State University

About

This project investigates the detection of AI-generated code versus human-written code, addressing academic integrity concerns in educational settings where generative models (e.g., ChatGPT, Codex) are increasingly used.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published