This repository contains the implementation of our CS577 project, where we explore improving the robustness of AI-generated text detection by using adversarial learning, inspired by the RADAR framework.
Original paper reference: RADAR: Robust AI-text detection via adversarial learning
Recent advancements in large language models (LLMs) have made it increasingly difficult to distinguish between machine-generated and human-written text. This project aims to:
- Investigate the robustness of AI-text detectors under adversarial paraphrasing.
- Develop a hybrid paraphrasing strategy combining backtranslation and neural paraphrasing.
- Train a distilBERT-based detector to classify texts as AI-generated or human-written.
- Hybrid paraphrasing pipeline: Combines multilingual backtranslation and lexical neural paraphrasing.
- Adversarial training loop: Reinforcement learning-based paraphraser competes against a binary classifier.
- Evaluation on real-world data: Includes a manually annotated dataset of LinkedIn posts.
- Python >= 3.9
- PyTorch >= 2.6.0 with CUDA support
- HuggingFace
transformers&datasets - NLTK
- Helsinki-NLP models
| Split | Source | Count |
|---|---|---|
| Training | OpenWebText (filtered) | 9,000 |
| Validation | OpenWebText (filtered) | 1,000 |
| Test | LinkedIn posts | 45 |
- Backtranslation: English → French → English using Helsinki-NLP.
- Neural paraphrasing: NLTK-based paraphraser to create more natural variations.
- Detector:
distilbert-base-uncasedbinary classifier. - Paraphraser:
t5-smallfine-tuned with PPO.
- Paraphraser generates samples to fool the detector.
- Detector learns from these new samples to improve classification.
- AUROC, Accuracy, F1, Precision, Recall.
git clone https://github.com/Subangkar/cs577-project.git
cd cs577-project
python3 -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
pip install -r requirements.txtRun training:
python radar.pyRun evaluation:
python radar_evaluate.py