Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Releases: kaanrkaraman/code2doc

Code2Doc Dataset Curation Pipeline (v1.0.0)

21 Dec 07:35
2be322d

Choose a tag to compare

This release corresponds to the version of the Code2Doc dataset curation pipeline
described in the accompanying research paper:

“Code2Doc: A Curated Dataset for High-Quality Code Documentation.”

Purpose

This release provides a frozen, reproducible implementation of the full dataset
curation pipeline, including:

  • Repository-level data extraction
  • Multi-stage heuristic filtering
  • Documentation quality scoring
  • Exact and near-duplicate removal
  • Heuristic detection of potentially AI-generated documentation

All thresholds, heuristics, and design choices correspond exactly to those reported
in the paper.

Reproducibility

  • All filtering thresholds are centralized and configurable
  • Deduplication is deterministic
  • No post-release changes should be assumed for this version

This release is archived via Zenodo and intended for use in reproducible research.
Future development will occur in subsequent versions.