Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ASTChunk is a Python toolkit for code chunking using Abstract Syntax Trees (ASTs), designed to create structurally sound and meaningful code segments.

License

Notifications You must be signed in to change notification settings

yilinjz/astchunk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASTChunk

This repository contains code for AST-based code chunking that preserves syntactic structure and semantic boundaries. ASTChunk intelligently divides source code into meaningful chunks while respecting the Abstract Syntax Tree (AST) structure, making it ideal for code analysis, documentation generation, and machine learning applications.

This work is described in the following paper:

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu

Bibtex for citations:

@misc{zhang-etal-2025-astchunk,
      title={cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree}, 
      author={Yilin Zhang and Xinran Zhao and Zora Zhiruo Wang and Chenyang Yang and Jiayi Wei and Tongshuang Wu},
      year={2025},
      url={https://arxiv.org/abs/2506.15655}, 
}

Installation

From PyPI:

pip install astchunk

From source:

git clone [email protected]:yilinjz/astchunk.git
pip install -e .

ASTChunk depends on tree-sitter for parsing. The required language parsers are automatically installed:

# Core dependencies (automatically installed)
pip install numpy pyrsistent tree-sitter
pip install tree-sitter-python tree-sitter-java tree-sitter-c-sharp tree-sitter-typescript

Configuration Options

  • max_chunk_size: Maximum non-whitespace characters per chunk
  • language: Programming language for parsing
  • metadata_template: Format for chunk metadata
  • repo_level_metadata (optional): Repository-level metadata (e.g., repo name, file path)
  • chunk_overlap (optional): Number of AST nodes to overlap between chunks
  • chunk_expansion (optional): Whether to perform chunk expansion (i.e., add metadata headers to chunks)

Quick Start

from astchunk import ASTChunkBuilder

# Your source code
code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

class Calculator:
    def add(self, a, b):
        return a + b
    
    def multiply(self, a, b):
        return a * b
"""

# Initialize the chunk builder
configs = {
    "max_chunk_size": 100,             # Maximum non-whitespace characters per chunk
    "language": "python",              # Supported: python, java, csharp, typescript
    "metadata_template": "default"     # Metadata format for output
}
chunk_builder = ASTChunkBuilder(**configs)

# Create chunks
chunks = chunk_builder.chunkify(code)

# Each chunk contains content and metadata
for i, chunk in enumerate(chunks):
    print(f"[Chunk {i+1}]")
    print(f"{chunk['content']}")
    print(f"Metadata: {chunk['metadata']}")
    print("-" * 50)

Advanced Usage

Customizing Chunk Parameters

# Add repo-level metadata
configs['repo_level_metadata'] = {
    "filepath": "src/calculator.py"
}

# Enable overlapping between chunks
configs['chunk_overlap'] = 1

# Add chunk expansion (metadata headers)
configs['chunk_expansion'] = True

# NOTE: max_chunk_size apply to the chunks before overlapping or chunk expansion.
# The final chunk size after overlapping or chunk expansion may exceed max_chunk_size.


# Extend current code for illustration
code += """
def divide(self, a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

# This is a comment
# Another comment

def subtract(self, a, b):
    return a - b

def exponent(self, a, b):
    return a ** b
"""


# Create chunks
chunks = chunk_builder.chunkify(code, **configs)

for i, chunk in enumerate(chunks):
    print(f"[Chunk {i+1}]")
    print(f"{chunk['content']}")
    print(f"Metadata: {chunk['metadata']}")
    print("-" * 50)

Working with Files

# Process a single file
with open("example.py", "r") as f:
    code = f.read()

# Alternatively, you can also create single-use configs for the optional arguments for each chunkify() call
single_use_configs = {
    "repo_level_metadata": {
        "filepath": "example.py"
    },
    "chunk_expansion": True
}

chunks = chunk_builder.chunkify(code, **single_use_configs)

# Save chunks to separate files
for i, chunk in enumerate(chunks):
    with open(f"chunk_{i+1}.py", "w") as f:
        f.write(chunk['content'])

Processing Multiple Languages

# Python code
python_builder = ASTChunkBuilder(
    max_chunk_size=1500,
    language="python",
    metadata_template="default"
)

# Java code  
java_builder = ASTChunkBuilder(
    max_chunk_size=2000,
    language="java", 
    metadata_template="default"
)

# TypeScript code
ts_builder = ASTChunkBuilder(
    max_chunk_size=1800,
    language="typescript",
    metadata_template="default"
)

Supported Languages

Language File Extensions Status
Python .py ✅ Full support
Java .java ✅ Full support
C# .cs ✅ Full support
TypeScript .ts, .tsx ✅ Full support

License

This project is licensed under the MIT License - see the LICENSE file for details.

Version

Current version: 0.1.0

About

ASTChunk is a Python toolkit for code chunking using Abstract Syntax Trees (ASTs), designed to create structurally sound and meaningful code segments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages