Semantic code chunking for LLM processing with policy-based configuration
Project description
๐งฉ Semantic Code Chunker
๐ Table of Contents
- Overview
- Features
- Installation
- Quick Start
- Supported Languages
- Configuration
- API Reference
- Advanced Usage
- Architecture
- Contributing
- License
๐ฏ Overview
Semantic Code Chunker is a Python library that intelligently splits source code into meaningful chunks based on Abstract Syntax Tree (AST) analysis. Unlike naive character-based chunking, this library understands the structure of your code and ensures that:
- โ Functions are not split in half
- โ Classes stay together with their methods
- โ Context is preserved for LLM understanding
- โ Configurable per-language policies
Perfect for RAG systems, code analysis tools, and AI-powered development assistants.
โจ Features
| Feature | Description |
|---|---|
| ๐ณ AST-Based Parsing | Uses tree-sitter for accurate code structure understanding |
| ๐ฏ Semantic Chunking | Chunks by functions, classes, methods - not arbitrary positions |
| โ๏ธ Policy-Driven | YAML configuration per language with fine-grained control |
| ๐ Multi-Language | Supports 12+ languages including legacy systems (COBOL, JCL) |
| ๐ง Fallback Mode | Regex-based parsing when tree-sitter unavailable |
| ๐ Rich Metadata | Each chunk includes type, line numbers, function names |
| โก High Performance | Optimized for large codebases |
๐ฆ Installation
Using pip
pip install rag-semantic-chunker
Using uv (recommended - faster)
uv add rag-semantic-chunker
Development Installation
git clone https://github.com/zthanos/SemanticCodeChunker.git
cd SemanticCodeChunker
uv venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
uv pip install -e ".[dev]"
๐ Quick Start
Basic Example
from semantic_chunker import SemanticCodeChunker
# Initialize with policy file
chunker = SemanticCodeChunker("policy.yaml")
# Sample Python code
code = """
def hello_world():
print("Hello World")
class Calculator:
def add(self, a, b):
return a + b
def subtract(self, a, b):
return a - b
"""
# Chunk the code (max 500 chars per chunk for demo)
chunks = chunker.chunk_code(
language="python",
source_code=code,
context_size=500
)
print(f"Created {len(chunks)} chunks:")
for i, chunk in enumerate(chunks):
print(f"\n--- Chunk {i+1} ---")
print(chunk[:200] + "..." if len(chunk) > 200 else chunk)
Output
Created 2 chunks:
--- Chunk 1 ---
def hello_world():
print("Hello World")
--- Chunk 2 ---
class Calculator:
def add(self, a, b):
return a + b
def subtract(self, a, b):
return a - b
๐ Supported Languages
| Language | Parser | Target Nodes | Notes |
|---|---|---|---|
| Python | python |
functions, classes, decorators | Full support |
| Java | java |
methods, classes, interfaces | Enterprise ready |
| JavaScript | javascript |
functions, arrow funcs, classes | ES6+ supported |
| TypeScript | typescript |
All JS + types, interfaces | Type-aware |
| C | c |
functions, structs, enums | Systems programming |
| C++ | cpp |
Classes, templates, namespaces | OOP features |
| C# | c_sharp |
Methods, properties, events | .NET ecosystem |
| COBOL | cobol |
Paragraphs, sections | Legacy/mainframe โ ๏ธ |
| JCL | jcl |
Jobs, EXEC, DD statements | Mainframe jobs โ ๏ธ |
| CLIST | clist |
Procedures, functions | TSO/E commands โ ๏ธ |
| Visual Basic | vbnet |
Subs, functions, classes | VB.NET support |
| Elixir | elixir |
def, modules, protocols | Functional paradigm |
โ ๏ธ Note: Legacy languages (COBOL, JCL, CLIST) use fallback regex parsing when tree-sitter grammars unavailable.
โ๏ธ Configuration
Policy File Structure
Create a policy.yaml file to configure chunking behavior:
defaults:
max_context_size: 2048 # Default chars per chunk
min_chunk_size: 100 # Minimum chunk size
include_comments: true # Include comments in chunks
languages:
python:
parser_name: "python"
target_nodes:
- "function_definition"
- "class_definition"
- "method_definition"
metadata_fields:
- "node_type"
- "line_number"
- "function_name"
java:
parser_name: "java"
target_nodes:
- "method_declaration"
- "class_declaration"
Programmatic Configuration
from semantic_chunker.config import PolicyConfig, LanguageConfig
# Create custom configuration
config = PolicyConfig(
max_context_size=4096,
languages={
"python": LanguageConfig(
parser_name="python",
target_nodes=["function_definition", "class_definition"]
)
}
)
chunker = SemanticCodeChunker(config=config) # Pass config directly
๐ API Reference
SemanticCodeChunker
Main class for chunking code.
Constructor
def __init__(self, policy_path: str | None = None, config: PolicyConfig | None = None):
"""
Initialize the chunker.
Args:
policy_path: Path to YAML policy file (optional)
config: Direct PolicyConfig object (optional)
Raises:
FileNotFoundError: If policy_path doesn't exist
ValueError: If configuration is invalid
"""
chunk_code() Method
def chunk_code(
self,
language: str,
source_code: str,
context_size: int | None = None
) -> List[str]:
"""
Chunk source code semantically.
Args:
language: Language identifier (e.g., 'python', 'java')
source_code: Source code string to chunk
context_size: Override default max chunk size
Returns:
List of chunk strings respecting semantic boundaries
Example:
>>> chunks = chunker.chunk_code("python", code, context_size=2048)
>>> len(chunks) # Number of chunks created
5
"""
ParsedChunk Data Class
Detailed chunk information with metadata.
@dataclass
class ParsedChunk:
code: str # The actual code content
metadata: ChunkMetadata # Metadata about the chunk
# Accessing metadata
chunk = chunks[0]
print(chunk.metadata.node_type) # "function_definition"
print(chunk.metadata.line_number) # 42
print(chunk.metadata.function_name) # "calculate_sum"
๐ฌ Advanced Usage
Getting Detailed Chunk Information
from semantic_chunker import SemanticCodeChunker
from semantic_chunker.parser import parse_and_chunk
# Get chunks with full metadata
chunks = parse_and_chunk(
source_code=code,
language="python",
target_nodes=["function_definition", "class_definition"],
max_context_size=2048
)
for chunk in chunks:
print(f"Type: {chunk.metadata.node_type}")
print(f"Lines: {chunk.metadata.line_number}-{chunk.metadata.end_line}")
print(f"Function: {chunk.metadata.function_name}")
print(f"Code:\n{chunk.code}\n")
Processing Large Files
def process_large_file(file_path: str, language: str) -> List[str]:
"""Process large files in streaming fashion."""
with open(file_path, 'r', encoding='utf-8') as f:
source_code = f.read()
chunker = SemanticCodeChunker("policy.yaml")
# Process with larger context for big files
chunks = chunker.chunk_code(
language=language,
source_code=source_code,
context_size=4096 # Larger chunks for complex codebases
)
return chunks
# Usage
chunks = process_large_file("src/main.py", "python")
Custom Language Support
from semantic_chunker.parser import ASTParser
class RustParser(ASTParser):
"""Custom parser for Rust (not yet in tree-sitter-languages)."""
def _get_fallback_patterns(self) -> Dict[str, str]:
return {
"function": r'\bfn\s+(\w+)\s*\(',
"struct": r'\bstruct\s+(\w+)',
"impl_block": r'\bimpl\s+',
}
# Use custom parser
rust_parser = RustParser("rust")
๐๏ธ Architecture
System Diagram
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Semantic Code Chunker โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Policy โ โ AST โ โ Chunk โ โ
โ โ Config โโโโโถโ Parser โโโโโถโ Extractor โ โ
โ โ (policy.yaml)โ โ(tree-sitter) โ โ โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ SemanticCodeChunker โ โ
โ โ โ โ
โ โ chunk_code(language, source_code, context_size) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโ โ
โ โ List[ParsedChunk]โ โ
โ โ โ โ
โ โ - code: str โ โ
โ โ - metadata โ โ
โ โโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Flow
- Load Policy โ Read YAML configuration for target language
- Parse AST โ tree-sitter creates Abstract Syntax Tree
- Query Nodes โ Extract functions, classes, methods based on policy
- Group Chunks โ Combine nodes respecting context_size limit
- Extract Metadata โ Add line numbers, names, types to each chunk
๐ค Contributing
We welcome contributions! Here's how you can help:
Setting Up Development Environment
# Clone repository
git clone https://github.com/zthanos/SemanticCodeChunker.git
cd SemanticCodeChunker
# Create virtual environment
uv venv .venv
source .venv/bin/activate # Linux/macOS
# Install with dev dependencies
uv pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black src/semantic_chunker tests/
# Type checking
mypy src/semantic_chunker
Adding Support for New Language
- Add parser name to
LANGUAGE_PARSERSinparser.py - Add default config in
config.pyโget_default_languages() - Add target nodes in example
policy.yaml - Write tests in
tests/test_<language>.py
Running Tests
# All tests
pytest
# Specific test file
pytest tests/test_python_chunking.py
# With coverage
pytest --cov=src/semantic_chunker --cov-report=html
# Verbose output
pytest -v
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Support & Contact
- Issues: GitHub Issues
- Discussions: GitHub Discussions
โญ Star this repo if you find it helpful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_semantic_chunker-0.1.0.tar.gz.
File metadata
- Download URL: rag_semantic_chunker-0.1.0.tar.gz
- Upload date:
- Size: 108.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85f155b48aa376c0c8dbb2bb8690995f6b9122b301120a43a26f86461190eb2d
|
|
| MD5 |
6186d2a60345b0a6ec290774e61a2e92
|
|
| BLAKE2b-256 |
2d1c9f68c42b58e77ef9d2de666b5edd4f56b75e3b6692708536506fdff52e2f
|
File details
Details for the file rag_semantic_chunker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rag_semantic_chunker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ecd589b62bd490e40d314bd098cbeca521767809fdbac833f038ef125002cd8
|
|
| MD5 |
4364d0e1d709740a11ac6717781e4355
|
|
| BLAKE2b-256 |
c5b2ed31332b54e4d4ae06a45043801f48f7e50a889372872b34e8e7be4da839
|