Releases: kaanrkaraman/code2doc
Releases · kaanrkaraman/code2doc
Code2Doc Dataset Curation Pipeline (v1.0.0)
This release corresponds to the version of the Code2Doc dataset curation pipeline
described in the accompanying research paper:
“Code2Doc: A Curated Dataset for High-Quality Code Documentation.”
Purpose
This release provides a frozen, reproducible implementation of the full dataset
curation pipeline, including:
- Repository-level data extraction
- Multi-stage heuristic filtering
- Documentation quality scoring
- Exact and near-duplicate removal
- Heuristic detection of potentially AI-generated documentation
All thresholds, heuristics, and design choices correspond exactly to those reported
in the paper.
Reproducibility
- All filtering thresholds are centralized and configurable
- Deduplication is deterministic
- No post-release changes should be assumed for this version
This release is archived via Zenodo and intended for use in reproducible research.
Future development will occur in subsequent versions.