Skip to main content

AST-based code chunking library for improved code analysis and processing

Project description

ASTChunk

This repository contains code for AST-based code chunking that preserves syntactic structure and semantic boundaries. ASTChunk intelligently divides source code into meaningful chunks while respecting the Abstract Syntax Tree (AST) structure, making it ideal for code analysis, documentation generation, and machine learning applications.

Installation

From PyPI:

pip install astchunk

From source:

git clone git@github.com:yilinjz/astchunk.git
pip install -e .

ASTChunk depends on tree-sitter for parsing. The required language parsers are automatically installed:

# Core dependencies (automatically installed)
pip install numpy pyrsistent tree-sitter
pip install tree-sitter-python tree-sitter-java tree-sitter-c-sharp tree-sitter-typescript

Configuration Options

  • max_chunk_size: Maximum non-whitespace characters per chunk
  • language: Programming language for parsing
  • metadata_template: Format for chunk metadata
  • repo_level_metadata (optional): Repository-level metadata (e.g., repo name, file path)
  • chunk_overlap (optional): Number of AST nodes to overlap between chunks
  • chunk_expansion (optional): Whether to perform chunk expansion (i.e., add metadata headers to chunks)

Quick Start

from astchunk import ASTChunkBuilder

# Your source code
code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

class Calculator:
    def add(self, a, b):
        return a + b
    
    def multiply(self, a, b):
        return a * b
"""

# Initialize the chunk builder
configs = {
    "max_chunk_size": 100,             # Maximum non-whitespace characters per chunk
    "language": "python",              # Supported: python, java, csharp, typescript
    "metadata_template": "default"     # Metadata format for output
}
chunk_builder = ASTChunkBuilder(**configs)

# Create chunks
chunks = chunk_builder.chunkify(code)

# Each chunk contains content and metadata
for i, chunk in enumerate(chunks):
    print(f"[Chunk {i+1}]")
    print(f"{chunk['content']}")
    print(f"Metadata: {chunk['metadata']}")
    print("-" * 50)

Advanced Usage

Customizing Chunk Parameters

# Add repo-level metadata
configs['repo_level_metadata'] = {
    "filepath": "src/calculator.py"
}

# Enable overlapping between chunks
configs['chunk_overlap'] = 1

# Add chunk expansion (metadata headers)
configs['chunk_expansion'] = True

# NOTE: max_chunk_size apply to the chunks before overlapping or chunk expansion.
# The final chunk size after overlapping or chunk expansion may exceed max_chunk_size.


# Extend current code for illustration
code += """
def divide(self, a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

# This is a comment
# Another comment

def subtract(self, a, b):
    return a - b

def exponent(self, a, b):
    return a ** b
"""


# Create chunks
chunks = chunk_builder.chunkify(code, **configs)

for i, chunk in enumerate(chunks):
    print(f"[Chunk {i+1}]")
    print(f"{chunk['content']}")
    print(f"Metadata: {chunk['metadata']}")
    print("-" * 50)

Working with Files

# Process a single file
with open("example.py", "r") as f:
    code = f.read()

# Alternatively, you can also create single-use configs for the optional arguments for each chunkify() call
single_use_configs = {
    "repo_level_metadata": {
        "filepath": "example.py"
    },
    "chunk_expansion": True
}

chunks = chunk_builder.chunkify(code, **single_use_configs)

# Save chunks to separate files
for i, chunk in enumerate(chunks):
    with open(f"chunk_{i+1}.py", "w") as f:
        f.write(chunk['content'])

Processing Multiple Languages

# Python code
python_builder = ASTChunkBuilder(
    max_chunk_size=1500,
    language="python",
    metadata_template="default"
)

# Java code  
java_builder = ASTChunkBuilder(
    max_chunk_size=2000,
    language="java", 
    metadata_template="default"
)

# TypeScript code
ts_builder = ASTChunkBuilder(
    max_chunk_size=1800,
    language="typescript",
    metadata_template="default"
)

Supported Languages

Language File Extensions Status
Python .py ✅ Full support
Java .java ✅ Full support
C# .cs ✅ Full support
TypeScript .ts, .tsx ✅ Full support

License

This project is licensed under the MIT License - see the LICENSE file for details.

Version

Current version: 0.1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astchunk-0.1.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

astchunk-0.1.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file astchunk-0.1.0.tar.gz.

File metadata

  • Download URL: astchunk-0.1.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for astchunk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f4dff0ef8b3b3bcfeac363384db1e153f74d4c825dc2e35864abfab027713be4
MD5 b11f615dc7a97382c2d47228d4fd8d2e
BLAKE2b-256 db2a7a35e2fac7d550265ae2ee40651425083b37555f921d1a1b77c3f525e0df

See more details on using hashes here.

File details

Details for the file astchunk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: astchunk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for astchunk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 33ada9fc3620807fdda5846fa1948af463f281a60e0d43d4f3782b6dbb416d24
MD5 632ce5e48b405fcf8ffd75e3ec46ce2b
BLAKE2b-256 be845433ab0e933b572750cb16fd7edf3d6c7902b069461a22ec670042752a4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page