This repository contains code for AST-based code chunking that preserves syntactic structure and semantic boundaries. ASTChunk intelligently divides source code into meaningful chunks while respecting the Abstract Syntax Tree (AST) structure, making it ideal for code analysis, documentation generation, and machine learning applications.
This work is described in the following paper:
cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree
Yilin Zhang, Xinran Zhao, Zora Zhiruo Wang, Chenyang Yang, Jiayi Wei, Tongshuang Wu
Bibtex for citations:
@misc{zhang-etal-2025-astchunk,
title={cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree},
author={Yilin Zhang and Xinran Zhao and Zora Zhiruo Wang and Chenyang Yang and Jiayi Wei and Tongshuang Wu},
year={2025},
url={https://arxiv.org/abs/2506.15655},
}
From PyPI:
pip install astchunk
From source:
git clone [email protected]:yilinjz/astchunk.git
pip install -e .
ASTChunk depends on tree-sitter for parsing. The required language parsers are automatically installed:
# Core dependencies (automatically installed)
pip install numpy pyrsistent tree-sitter
pip install tree-sitter-python tree-sitter-java tree-sitter-c-sharp tree-sitter-typescript
max_chunk_size
: Maximum non-whitespace characters per chunklanguage
: Programming language for parsingmetadata_template
: Format for chunk metadatarepo_level_metadata
(optional): Repository-level metadata (e.g., repo name, file path)chunk_overlap
(optional): Number of AST nodes to overlap between chunkschunk_expansion
(optional): Whether to perform chunk expansion (i.e., add metadata headers to chunks)
from astchunk import ASTChunkBuilder
# Your source code
code = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
class Calculator:
def add(self, a, b):
return a + b
def multiply(self, a, b):
return a * b
"""
# Initialize the chunk builder
configs = {
"max_chunk_size": 100, # Maximum non-whitespace characters per chunk
"language": "python", # Supported: python, java, csharp, typescript
"metadata_template": "default" # Metadata format for output
}
chunk_builder = ASTChunkBuilder(**configs)
# Create chunks
chunks = chunk_builder.chunkify(code)
# Each chunk contains content and metadata
for i, chunk in enumerate(chunks):
print(f"[Chunk {i+1}]")
print(f"{chunk['content']}")
print(f"Metadata: {chunk['metadata']}")
print("-" * 50)
# Add repo-level metadata
configs['repo_level_metadata'] = {
"filepath": "src/calculator.py"
}
# Enable overlapping between chunks
configs['chunk_overlap'] = 1
# Add chunk expansion (metadata headers)
configs['chunk_expansion'] = True
# NOTE: max_chunk_size apply to the chunks before overlapping or chunk expansion.
# The final chunk size after overlapping or chunk expansion may exceed max_chunk_size.
# Extend current code for illustration
code += """
def divide(self, a, b):
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
# This is a comment
# Another comment
def subtract(self, a, b):
return a - b
def exponent(self, a, b):
return a ** b
"""
# Create chunks
chunks = chunk_builder.chunkify(code, **configs)
for i, chunk in enumerate(chunks):
print(f"[Chunk {i+1}]")
print(f"{chunk['content']}")
print(f"Metadata: {chunk['metadata']}")
print("-" * 50)
# Process a single file
with open("example.py", "r") as f:
code = f.read()
# Alternatively, you can also create single-use configs for the optional arguments for each chunkify() call
single_use_configs = {
"repo_level_metadata": {
"filepath": "example.py"
},
"chunk_expansion": True
}
chunks = chunk_builder.chunkify(code, **single_use_configs)
# Save chunks to separate files
for i, chunk in enumerate(chunks):
with open(f"chunk_{i+1}.py", "w") as f:
f.write(chunk['content'])
# Python code
python_builder = ASTChunkBuilder(
max_chunk_size=1500,
language="python",
metadata_template="default"
)
# Java code
java_builder = ASTChunkBuilder(
max_chunk_size=2000,
language="java",
metadata_template="default"
)
# TypeScript code
ts_builder = ASTChunkBuilder(
max_chunk_size=1800,
language="typescript",
metadata_template="default"
)
Language | File Extensions | Status |
---|---|---|
Python | .py |
✅ Full support |
Java | .java |
✅ Full support |
C# | .cs |
✅ Full support |
TypeScript | .ts , .tsx |
✅ Full support |
This project is licensed under the MIT License - see the LICENSE file for details.
Current version: 0.1.0