A comprehensive pipeline for extracting high-quality function-documentation pairs from open-source repositories and fine-tuning language models for code documentation generation.
This project provides:
- Data Extraction Pipeline - Scrape GitHub repositories and extract function-documentation pairs using tree-sitter parsers
- Multi-Stage Filtering - Quality scoring, deduplication, and AI-generated content detection
- Dataset Preparation - Create train/val/test splits for model training
- Fine-tuning Ready - Dataset published on HuggingFace for easy integration
| Language | Parser | Documentation Style |
|---|---|---|
| Python | tree-sitter-python | Google, NumPy, Sphinx |
| TypeScript | tree-sitter-typescript | JSDoc |
| JavaScript | tree-sitter-javascript | JSDoc |
| Java | tree-sitter-java | Javadoc |
| C++ | tree-sitter-cpp | Doxygen |
# Clone the repository
git clone https://github.com/kaanrkaraman/DocstringGeneratorPEFT.git
cd DocstringGeneratorPEFT
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Set your GitHub token
export GITHUB_TOKEN="your_token_here"
# Extract data from configured repositories
python scripts/extract_data_parallel.py --languages python java --workers 4python scripts/run_pipeline.py --input data/raw/ --output data/processed/python scripts/create_splits.py --input data/processed/filtered_data.jsonl --output data/finalexport HF_TOKEN="your_hf_token"
python scripts/upload_to_hf.py --repo-name "username/dataset-name"DocstringGeneratorPEFT/
├── src/
│ ├── parsers/ # Language-specific parsers
│ │ ├── base_parser.py
│ │ ├── python_parser.py
│ │ ├── typescript_parser.py
│ │ ├── java_parser.py
│ │ └── cpp_parser.py
│ ├── filters/ # Data filtering pipeline
│ │ ├── basic_filters.py
│ │ ├── quality_scorer.py
│ │ ├── deduplicator.py
│ │ ├── ai_detector.py
│ │ └── pipeline.py
│ ├── scrapers/ # GitHub data extraction
│ │ ├── github_scraper.py
│ │ └── repo_selector.py
│ └── utils/ # Shared utilities
│ └── logging_config.py
├── scripts/ # CLI tools
│ ├── extract_data_parallel.py
│ ├── run_pipeline.py
│ ├── create_splits.py
│ └── upload_to_hf.py
├── config/
│ ├── repos.yaml # Repository configuration
│ └── filtering.yaml # Filtering parameters
└── data/
├── raw/ # Extracted raw data
├── processed/ # Filtered data
└── final/ # Train/val/test splits
The filtering pipeline applies multiple stages to ensure data quality:
| Stage | Description | Metrics |
|---|---|---|
| Basic Filtering | Length constraints, test function removal | ~50% retention |
| Quality Scoring | Documentation completeness, code quality | Score 0-10 |
| Deduplication | Exact + near-duplicate removal (MinHash LSH) | ~10% removal |
| AI Detection | Flag potentially AI-generated docs | Pattern matching |
python:
- name: django/django
- name: pytorch/pytorch
- name: pandas-dev/pandas
java:
- name: google/guava
- name: spring-projects/spring-frameworkbasic_filters:
min_doc_length: 20
max_doc_length: 10000
min_code_length: 50
max_code_length: 20000
quality_scoring:
min_score: 3.0
weights:
has_description: 2.0
has_parameters: 1.5
has_return_type: 1.0The curated dataset is available on HuggingFace:
- Total Samples: 13,358
- Train: 10,684 (80%)
- Validation: 1,334 (10%)
- Test: 1,340 (10%)
from datasets import load_dataset
dataset = load_dataset("kaanrkaraman/code2doc")
# Format for training
def format_prompt(example):
return {
"input": f"Generate documentation for:\n{example['function_code']}",
"output": example["documentation"]
}
train_data = dataset["train"].map(format_prompt)# Run type checking
mypy src
# Format code
black src scripts
isort src scripts@misc{recep_kaan_karaman_2025,
author = {Recep Kaan Karaman and Meftun Akarsu},
title = {code2doc (Revision cadd4e4)},
year = 2025,
url = {https://huggingface.co/datasets/kaanrkaraman/code2doc},
doi = {10.57967/hf/7310},
publisher = {Hugging Face}
}This project is licensed under the MIT License. See LICENSE for details.
- Recep Kaan Karaman
- Meftun Akarsu