Thanks to visit codestin.com
Credit goes to github.com

Skip to content

kaanrkaraman/code2doc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code2Doc: Docstring Generation with PEFT

A comprehensive pipeline for extracting high-quality function-documentation pairs from open-source repositories and fine-tuning language models for code documentation generation.

Dataset DOI Python 3.11+ License: MIT Code style: black

Overview

This project provides:

  1. Data Extraction Pipeline - Scrape GitHub repositories and extract function-documentation pairs using tree-sitter parsers
  2. Multi-Stage Filtering - Quality scoring, deduplication, and AI-generated content detection
  3. Dataset Preparation - Create train/val/test splits for model training
  4. Fine-tuning Ready - Dataset published on HuggingFace for easy integration

Supported Languages

Language Parser Documentation Style
Python tree-sitter-python Google, NumPy, Sphinx
TypeScript tree-sitter-typescript JSDoc
JavaScript tree-sitter-javascript JSDoc
Java tree-sitter-java Javadoc
C++ tree-sitter-cpp Doxygen

Installation

# Clone the repository
git clone https://github.com/kaanrkaraman/DocstringGeneratorPEFT.git
cd DocstringGeneratorPEFT

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

1. Extract Data from GitHub

# Set your GitHub token
export GITHUB_TOKEN="your_token_here"

# Extract data from configured repositories
python scripts/extract_data_parallel.py --languages python java --workers 4

2. Run Filtering Pipeline

python scripts/run_pipeline.py --input data/raw/ --output data/processed/

3. Create Train/Val/Test Splits

python scripts/create_splits.py --input data/processed/filtered_data.jsonl --output data/final

4. Upload to HuggingFace (Optional)

export HF_TOKEN="your_hf_token"
python scripts/upload_to_hf.py --repo-name "username/dataset-name"

Project Structure

DocstringGeneratorPEFT/
├── src/
│   ├── parsers/           # Language-specific parsers
│   │   ├── base_parser.py
│   │   ├── python_parser.py
│   │   ├── typescript_parser.py
│   │   ├── java_parser.py
│   │   └── cpp_parser.py
│   ├── filters/           # Data filtering pipeline
│   │   ├── basic_filters.py
│   │   ├── quality_scorer.py
│   │   ├── deduplicator.py
│   │   ├── ai_detector.py
│   │   └── pipeline.py
│   ├── scrapers/          # GitHub data extraction
│   │   ├── github_scraper.py
│   │   └── repo_selector.py
│   └── utils/             # Shared utilities
│       └── logging_config.py
├── scripts/               # CLI tools
│   ├── extract_data_parallel.py
│   ├── run_pipeline.py
│   ├── create_splits.py
│   └── upload_to_hf.py
├── config/
│   ├── repos.yaml         # Repository configuration
│   └── filtering.yaml     # Filtering parameters
└── data/
    ├── raw/               # Extracted raw data
    ├── processed/         # Filtered data
    └── final/             # Train/val/test splits

Filtering Pipeline

The filtering pipeline applies multiple stages to ensure data quality:

Stage Description Metrics
Basic Filtering Length constraints, test function removal ~50% retention
Quality Scoring Documentation completeness, code quality Score 0-10
Deduplication Exact + near-duplicate removal (MinHash LSH) ~10% removal
AI Detection Flag potentially AI-generated docs Pattern matching

Configuration

Repository Config (config/repos.yaml)

python:
  - name: django/django
  - name: pytorch/pytorch
  - name: pandas-dev/pandas

java:
  - name: google/guava
  - name: spring-projects/spring-framework

Filtering Config (config/filtering.yaml)

basic_filters:
  min_doc_length: 20
  max_doc_length: 10000
  min_code_length: 50
  max_code_length: 20000

quality_scoring:
  min_score: 3.0
  weights:
    has_description: 2.0
    has_parameters: 1.5
    has_return_type: 1.0

Dataset

The curated dataset is available on HuggingFace:

kaanrkaraman/code2doc

Statistics

  • Total Samples: 13,358
  • Train: 10,684 (80%)
  • Validation: 1,334 (10%)
  • Test: 1,340 (10%)

Usage

from datasets import load_dataset

dataset = load_dataset("kaanrkaraman/code2doc")

# Format for training
def format_prompt(example):
    return {
        "input": f"Generate documentation for:\n{example['function_code']}",
        "output": example["documentation"]
    }

train_data = dataset["train"].map(format_prompt)

Development

# Run type checking
mypy src

# Format code
black src scripts
isort src scripts

Citation

@misc{recep_kaan_karaman_2025,
  author       = {Recep Kaan Karaman and Meftun Akarsu},
  title        = {code2doc (Revision cadd4e4)},
  year         = 2025,
  url          = {https://huggingface.co/datasets/kaanrkaraman/code2doc},
  doi          = {10.57967/hf/7310},
  publisher    = {Hugging Face}
}

License

This project is licensed under the MIT License. See LICENSE for details.

Authors

  • Recep Kaan Karaman
  • Meftun Akarsu

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages