Visualize AI-generated content in PDF files using colorized bounding boxes. This tool extracts text with bounding boxes from PDFs using Docling, analyzes each segment with AI detection algorithms inspired by Fast-DetectGPT, and creates a colorized PDF where:
- π’ Green = Likely human-written text
- π‘ Yellow = Mixed/uncertain origin
- π΄ Red = Likely AI-generated text
- Accurate Text Extraction: Uses Docling to extract text with precise bounding box coordinates
- AI Detection: Analyzes text segments using perplexity-based detection algorithms
- Visual Output: Creates colorized PDFs with transparent overlays indicating AI likelihood
- Configurable: Adjust detection models, text granularity, merging, and visualization opacity
- Multiple Models: Supports various language models (GPT-2, DistilGPT-2, etc.)
UV is a fast Python package installer and environment manager:
# Clone the repository
git clone https://github.com/yourusername/pdf-ai-detect.git
cd pdf-ai-detect
# Create virtual environment and install dependencies
uv venv
uv pip install -r requirements.txt
# Run the script using uv
uv run python pdf_ai_colorize.py input.pdf output.pdf# Clone the repository
git clone https://github.com/yourusername/pdf-ai-detect.git
cd pdf-ai-detect
# Run the setup script
bash setup.sh# Install Python dependencies
pip install -r requirements.txt
# (Optional) Clone fast-detect-gpt for reference
git clone https://github.com/baoguangsheng/fast-detect-gpt.git- Python 3.8+
- PyTorch 1.10.0+
- CUDA (optional, for GPU acceleration)
- ~2GB disk space for models
# Using uv (automatically uses the virtual environment)
uv run python pdf_ai_colorize.py input.pdf output.pdf
# Or activate the virtual environment manually
source .venv/bin/activate
python pdf_ai_colorize.py input.pdf output.pdfuv run python pdf_ai_colorize.py input.pdf output.pdf \
--model gpt2-medium \
--unit-type line \
--merge-boxes 5 \
--opacity 0.3 \
--create-legend| Argument | Description | Default |
|---|---|---|
input_pdf |
Path to input PDF file | Required |
output_pdf |
Path to output colorized PDF | Required |
--model |
Model for AI detection (gpt2, gpt2-medium, gpt2-large, distilgpt2) | gpt2 |
--detector |
Detector type (simple, fast-detect-gpt) | simple |
--unit-type |
Text extraction granularity (char, word, line) | line |
--merge-boxes |
Number of boxes to merge into segments (1=no merge) | 5 |
--opacity |
Color overlay opacity (0.0-1.0) | 0.3 |
--create-legend |
Generate a color scale legend PDF | False |
--min-text-length |
Minimum characters to analyze | 10 |
uv run python pdf_ai_colorize.py research_paper.pdf analyzed.pdfUse a larger model and finer granularity for better accuracy:
uv run python pdf_ai_colorize.py document.pdf output.pdf \
--model gpt2-large \
--unit-type word \
--merge-boxes 10Use a smaller model for faster processing:
uv run python pdf_ai_colorize.py document.pdf output.pdf \
--model distilgpt2 \
--merge-boxes 3-
Text Extraction: Docling parses the PDF and extracts text with precise bounding box coordinates at character, word, or line level
-
Text Segmentation: Nearby text boxes are optionally merged into larger segments for better detection accuracy
-
AI Detection: Each text segment is analyzed using perplexity-based methods:
- Simple Detector: Fast perplexity scoring using GPT-2 family models
- Fast-DetectGPT: Advanced conditional probability curvature analysis
-
Scoring: Text receives a score from 0.0 (human-like) to 1.0 (AI-like) based on:
- Perplexity (lower = more predictable = more AI-like)
- Language model probability distributions
- Text complexity patterns
-
Colorization: Bounding boxes are colorized on a green-yellow-red scale based on scores
This tool uses perplexity-based detection inspired by Fast-DetectGPT (Bao et al., ICLR 2024). The core idea:
- Lower perplexity β More predictable text β Likely AI-generated
- Higher perplexity β Less predictable text β Likely human-written
The detector calculates log-likelihood scores using pretrained language models and converts them to probability estimates.
- distilgpt2: Fast, lightweight (~300MB)
- gpt2: Balanced speed/accuracy (~500MB)
- gpt2-medium: Better accuracy (~1.5GB)
- gpt2-large: Best accuracy (~3GB)
- Processing Speed: ~1-5 pages/minute (depending on model and hardware)
- GPU Acceleration: Automatically used if available (5-10x faster)
- Memory Usage: 2-8GB RAM (depending on model)
pdf-ai-detect/
βββ pdf_ai_colorize.py # Main script
βββ pdf_processor.py # PDF text extraction and colorization
βββ ai_detector.py # AI detection algorithms
βββ requirements.txt # Python dependencies
βββ setup.sh # Setup script
βββ .gitignore # Git ignore rules
βββ README.md # This file
- Accuracy: Detection is probabilistic and not 100% accurate
- Short Text: Very short segments (<10 words) are unreliable
- Language: Currently optimized for English text
- PDF Types: Works best with text-based PDFs (not scanned images)
- Model Bias: Detection quality depends on the model used
Contributions are welcome! Please feel free to submit issues or pull requests.
See LICENSE file for details.
If you use this tool in research, please cite the Fast-DetectGPT paper:
@inproceedings{bao2024fast,
title={Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature},
author={Bao, Guangsheng and Zhao, Yanbin and Teng, Zhiyang and Yang, Linyi and Zhang, Yue},
booktitle={ICLR},
year={2024}
}- Docling Project for excellent PDF parsing tools
- Bao et al. for Fast-DetectGPT algorithm
- Hugging Face for transformer models