AST-based code chunking library for improved code analysis and processing
Project description
ASTChunk
This repository contains code for AST-based code chunking that preserves syntactic structure and semantic boundaries. ASTChunk intelligently divides source code into meaningful chunks while respecting the Abstract Syntax Tree (AST) structure, making it ideal for code analysis, documentation generation, and machine learning applications.
Installation
From PyPI:
pip install astchunk
From source:
git clone git@github.com:yilinjz/astchunk.git
pip install -e .
ASTChunk depends on tree-sitter for parsing. The required language parsers are automatically installed:
# Core dependencies (automatically installed)
pip install numpy pyrsistent tree-sitter
pip install tree-sitter-python tree-sitter-java tree-sitter-c-sharp tree-sitter-typescript
Configuration Options
max_chunk_size: Maximum non-whitespace characters per chunklanguage: Programming language for parsingmetadata_template: Format for chunk metadatarepo_level_metadata(optional): Repository-level metadata (e.g., repo name, file path)chunk_overlap(optional): Number of AST nodes to overlap between chunkschunk_expansion(optional): Whether to perform chunk expansion (i.e., add metadata headers to chunks)
Quick Start
from astchunk import ASTChunkBuilder
# Your source code
code = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
class Calculator:
def add(self, a, b):
return a + b
def multiply(self, a, b):
return a * b
"""
# Initialize the chunk builder
configs = {
"max_chunk_size": 100, # Maximum non-whitespace characters per chunk
"language": "python", # Supported: python, java, csharp, typescript
"metadata_template": "default" # Metadata format for output
}
chunk_builder = ASTChunkBuilder(**configs)
# Create chunks
chunks = chunk_builder.chunkify(code)
# Each chunk contains content and metadata
for i, chunk in enumerate(chunks):
print(f"[Chunk {i+1}]")
print(f"{chunk['content']}")
print(f"Metadata: {chunk['metadata']}")
print("-" * 50)
Advanced Usage
Customizing Chunk Parameters
# Add repo-level metadata
configs['repo_level_metadata'] = {
"filepath": "src/calculator.py"
}
# Enable overlapping between chunks
configs['chunk_overlap'] = 1
# Add chunk expansion (metadata headers)
configs['chunk_expansion'] = True
# NOTE: max_chunk_size apply to the chunks before overlapping or chunk expansion.
# The final chunk size after overlapping or chunk expansion may exceed max_chunk_size.
# Extend current code for illustration
code += """
def divide(self, a, b):
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
# This is a comment
# Another comment
def subtract(self, a, b):
return a - b
def exponent(self, a, b):
return a ** b
"""
# Create chunks
chunks = chunk_builder.chunkify(code, **configs)
for i, chunk in enumerate(chunks):
print(f"[Chunk {i+1}]")
print(f"{chunk['content']}")
print(f"Metadata: {chunk['metadata']}")
print("-" * 50)
Working with Files
# Process a single file
with open("example.py", "r") as f:
code = f.read()
# Alternatively, you can also create single-use configs for the optional arguments for each chunkify() call
single_use_configs = {
"repo_level_metadata": {
"filepath": "example.py"
},
"chunk_expansion": True
}
chunks = chunk_builder.chunkify(code, **single_use_configs)
# Save chunks to separate files
for i, chunk in enumerate(chunks):
with open(f"chunk_{i+1}.py", "w") as f:
f.write(chunk['content'])
Processing Multiple Languages
# Python code
python_builder = ASTChunkBuilder(
max_chunk_size=1500,
language="python",
metadata_template="default"
)
# Java code
java_builder = ASTChunkBuilder(
max_chunk_size=2000,
language="java",
metadata_template="default"
)
# TypeScript code
ts_builder = ASTChunkBuilder(
max_chunk_size=1800,
language="typescript",
metadata_template="default"
)
Supported Languages
| Language | File Extensions | Status |
|---|---|---|
| Python | .py |
✅ Full support |
| Java | .java |
✅ Full support |
| C# | .cs |
✅ Full support |
| TypeScript | .ts, .tsx |
✅ Full support |
License
This project is licensed under the MIT License - see the LICENSE file for details.
Version
Current version: 0.1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file astchunk-0.1.0.tar.gz.
File metadata
- Download URL: astchunk-0.1.0.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4dff0ef8b3b3bcfeac363384db1e153f74d4c825dc2e35864abfab027713be4
|
|
| MD5 |
b11f615dc7a97382c2d47228d4fd8d2e
|
|
| BLAKE2b-256 |
db2a7a35e2fac7d550265ae2ee40651425083b37555f921d1a1b77c3f525e0df
|
File details
Details for the file astchunk-0.1.0-py3-none-any.whl.
File metadata
- Download URL: astchunk-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33ada9fc3620807fdda5846fa1948af463f281a60e0d43d4f3782b6dbb416d24
|
|
| MD5 |
632ce5e48b405fcf8ffd75e3ec46ce2b
|
|
| BLAKE2b-256 |
be845433ab0e933b572750cb16fd7edf3d6c7902b069461a22ec670042752a4d
|