Semantic code chunking for LLM processing with policy-based configuration

These details have not been verified by PyPI

Project links

Project description

🧩 Semantic Code Chunker

Semantic code chunking library optimized for LLM processing and RAG systems

📖 Table of Contents

Overview
Features
Installation
Quick Start
Supported Languages
Configuration
API Reference
Advanced Usage
Architecture
Contributing
License

🎯 Overview

Semantic Code Chunker is a Python library that intelligently splits source code into meaningful chunks based on Abstract Syntax Tree (AST) analysis. Unlike naive character-based chunking, this library understands the structure of your code and ensures that:

✅ Functions are not split in half
✅ Classes stay together with their methods
✅ Context is preserved for LLM understanding
✅ Configurable per-language policies

Perfect for RAG systems, code analysis tools, and AI-powered development assistants.

✨ Features

Feature	Description
🌳 AST-Based Parsing	Uses tree-sitter for accurate code structure understanding
🎯 Semantic Chunking	Chunks by functions, classes, methods - not arbitrary positions
⚙️ Policy-Driven	YAML configuration per language with fine-grained control
🌍 Multi-Language	Supports 12+ languages including legacy systems (COBOL, JCL)
🔧 Fallback Mode	Regex-based parsing when tree-sitter unavailable
📊 Rich Metadata	Each chunk includes type, line numbers, function names
⚡ High Performance	Optimized for large codebases

📦 Installation

Using pip

pip install rag-semantic-chunker

Using uv (recommended - faster)

uv add rag-semantic-chunker

Development Installation

git clone https://github.com/zthanos/SemanticCodeChunker.git
cd SemanticCodeChunker
uv venv .venv
source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows
uv pip install -e ".[dev]"

🚀 Quick Start

Basic Example

from semantic_chunker import SemanticCodeChunker

# Initialize with policy file
chunker = SemanticCodeChunker("policy.yaml")

# Sample Python code
code = """
def hello_world():
    print("Hello World")

class Calculator:
    def add(self, a, b):
        return a + b
    
    def subtract(self, a, b):
        return a - b
"""

# Chunk the code (max 500 chars per chunk for demo)
chunks = chunker.chunk_code(
    language="python", 
    source_code=code,
    context_size=500
)

print(f"Created {len(chunks)} chunks:")
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk[:200] + "..." if len(chunk) > 200 else chunk)

Output

Created 2 chunks:

--- Chunk 1 ---
def hello_world():
    print("Hello World")

--- Chunk 2 ---
class Calculator:
    def add(self, a, b):
        return a + b
    
    def subtract(self, a, b):
        return a - b

🌍 Supported Languages

Language	Parser	Target Nodes	Notes
Python	`python`	functions, classes, decorators	Full support
Java	`java`	methods, classes, interfaces	Enterprise ready
JavaScript	`javascript`	functions, arrow funcs, classes	ES6+ supported
TypeScript	`typescript`	All JS + types, interfaces	Type-aware
C	`c`	functions, structs, enums	Systems programming
C++	`cpp`	Classes, templates, namespaces	OOP features
C#	`c_sharp`	Methods, properties, events	.NET ecosystem
COBOL	`cobol`	Paragraphs, sections	Legacy/mainframe ⚠️
JCL	`jcl`	Jobs, EXEC, DD statements	Mainframe jobs ⚠️
CLIST	`clist`	Procedures, functions	TSO/E commands ⚠️
Visual Basic	`vbnet`	Subs, functions, classes	VB.NET support
Elixir	`elixir`	def, modules, protocols	Functional paradigm

⚠️ Note: Legacy languages (COBOL, JCL, CLIST) use fallback regex parsing when tree-sitter grammars unavailable.

⚙️ Configuration

Policy File Structure

Create a policy.yaml file to configure chunking behavior:

defaults:
  max_context_size: 2048      # Default chars per chunk
  min_chunk_size: 100         # Minimum chunk size
  include_comments: true      # Include comments in chunks
  
languages:
  python:
    parser_name: "python"
    target_nodes:
      - "function_definition"
      - "class_definition"
      - "method_definition"
    
    metadata_fields:
      - "node_type"
      - "line_number"
      - "function_name"
      
  java:
    parser_name: "java"
    target_nodes:
      - "method_declaration"
      - "class_declaration"

Programmatic Configuration

from semantic_chunker.config import PolicyConfig, LanguageConfig

# Create custom configuration
config = PolicyConfig(
    max_context_size=4096,
    languages={
        "python": LanguageConfig(
            parser_name="python",
            target_nodes=["function_definition", "class_definition"]
        )
    }
)

chunker = SemanticCodeChunker(config=config)  # Pass config directly

📚 API Reference

`SemanticCodeChunker`

Main class for chunking code.

Constructor

def __init__(self, policy_path: str | None = None, config: PolicyConfig | None = None):
    """
    Initialize the chunker.
    
    Args:
        policy_path: Path to YAML policy file (optional)
        config: Direct PolicyConfig object (optional)
        
    Raises:
        FileNotFoundError: If policy_path doesn't exist
        ValueError: If configuration is invalid
    """

`chunk_code()` Method

def chunk_code(
    self, 
    language: str, 
    source_code: str, 
    context_size: int | None = None
) -> List[str]:
    """
    Chunk source code semantically.
    
    Args:
        language: Language identifier (e.g., 'python', 'java')
        source_code: Source code string to chunk
        context_size: Override default max chunk size
        
    Returns:
        List of chunk strings respecting semantic boundaries
        
    Example:
        >>> chunks = chunker.chunk_code("python", code, context_size=2048)
        >>> len(chunks)  # Number of chunks created
        5
    """

`ParsedChunk` Data Class

Detailed chunk information with metadata.

@dataclass
class ParsedChunk:
    code: str                    # The actual code content
    metadata: ChunkMetadata      # Metadata about the chunk
    
# Accessing metadata
chunk = chunks[0]
print(chunk.metadata.node_type)     # "function_definition"
print(chunk.metadata.line_number)   # 42
print(chunk.metadata.function_name) # "calculate_sum"

🔬 Advanced Usage

Getting Detailed Chunk Information

from semantic_chunker import SemanticCodeChunker
from semantic_chunker.parser import parse_and_chunk

# Get chunks with full metadata
chunks = parse_and_chunk(
    source_code=code,
    language="python",
    target_nodes=["function_definition", "class_definition"],
    max_context_size=2048
)

for chunk in chunks:
    print(f"Type: {chunk.metadata.node_type}")
    print(f"Lines: {chunk.metadata.line_number}-{chunk.metadata.end_line}")
    print(f"Function: {chunk.metadata.function_name}")
    print(f"Code:\n{chunk.code}\n")

Processing Large Files

def process_large_file(file_path: str, language: str) -> List[str]:
    """Process large files in streaming fashion."""
    
    with open(file_path, 'r', encoding='utf-8') as f:
        source_code = f.read()
        
    chunker = SemanticCodeChunker("policy.yaml")
    
    # Process with larger context for big files
    chunks = chunker.chunk_code(
        language=language,
        source_code=source_code,
        context_size=4096  # Larger chunks for complex codebases
    )
    
    return chunks

# Usage
chunks = process_large_file("src/main.py", "python")

Custom Language Support

from semantic_chunker.parser import ASTParser

class RustParser(ASTParser):
    """Custom parser for Rust (not yet in tree-sitter-languages)."""
    
    def _get_fallback_patterns(self) -> Dict[str, str]:
        return {
            "function": r'\bfn\s+(\w+)\s*\(',
            "struct": r'\bstruct\s+(\w+)',
            "impl_block": r'\bimpl\s+',
        }

# Use custom parser
rust_parser = RustParser("rust")

🏗️ Architecture

System Diagram

┌─────────────────────────────────────────────────────────────────┐
│                      Semantic Code Chunker                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │   Policy     │    │    AST       │    │   Chunk      │      │
│  │   Config     │───▶│    Parser    │───▶│   Extractor  │      │
│  │  (policy.yaml)│    │(tree-sitter) │    │              │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│         │                  │                    │               │
│         ▼                  ▼                    ▼               │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   SemanticCodeChunker                    │   │
│  │                                                         │   │
│  │  chunk_code(language, source_code, context_size)       │   │
│  │                                                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│                    ┌───────────────────┐                       │
│                    │  List[ParsedChunk]│                       │
│                    │                   │                       │
│                    │  - code: str      │                       │
│                    │  - metadata       │                       │
│                    └───────────────────┘                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Flow

Load Policy → Read YAML configuration for target language
Parse AST → tree-sitter creates Abstract Syntax Tree
Query Nodes → Extract functions, classes, methods based on policy
Group Chunks → Combine nodes respecting context_size limit
Extract Metadata → Add line numbers, names, types to each chunk

🤝 Contributing

We welcome contributions! Here's how you can help:

Setting Up Development Environment

# Clone repository
git clone https://github.com/zthanos/SemanticCodeChunker.git
cd SemanticCodeChunker

# Create virtual environment
uv venv .venv
source .venv/bin/activate  # Linux/macOS

# Install with dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black src/semantic_chunker tests/

# Type checking
mypy src/semantic_chunker

Adding Support for New Language

Add parser name to LANGUAGE_PARSERS in parser.py
Add default config in config.py → get_default_languages()
Add target nodes in example policy.yaml
Write tests in tests/test_<language>.py

Running Tests

# All tests
pytest

# Specific test file
pytest tests/test_python_chunking.py

# With coverage
pytest --cov=src/semantic_chunker --cov-report=html

# Verbose output
pytest -v

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support & Contact

Issues: GitHub Issues
Discussions: GitHub Discussions

Made with ❤️ for the AI and Developer Community

⭐ Star this repo if you find it helpful!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_semantic_chunker-0.1.0.tar.gz (108.2 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_semantic_chunker-0.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file rag_semantic_chunker-0.1.0.tar.gz.

File metadata

Download URL: rag_semantic_chunker-0.1.0.tar.gz
Upload date: Apr 1, 2026
Size: 108.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rag_semantic_chunker-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`85f155b48aa376c0c8dbb2bb8690995f6b9122b301120a43a26f86461190eb2d`
MD5	`6186d2a60345b0a6ec290774e61a2e92`
BLAKE2b-256	`2d1c9f68c42b58e77ef9d2de666b5edd4f56b75e3b6692708536506fdff52e2f`

See more details on using hashes here.

File details

Details for the file rag_semantic_chunker-0.1.0-py3-none-any.whl.

File metadata

Download URL: rag_semantic_chunker-0.1.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 24.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rag_semantic_chunker-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ecd589b62bd490e40d314bd098cbeca521767809fdbac833f038ef125002cd8`
MD5	`4364d0e1d709740a11ac6717781e4355`
BLAKE2b-256	`c5b2ed31332b54e4d4ae06a45043801f48f7e50a889372872b34e8e7be4da839`

See more details on using hashes here.

rag-semantic-chunker 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧩 Semantic Code Chunker

📖 Table of Contents

🎯 Overview

✨ Features

📦 Installation

Using pip

Using uv (recommended - faster)

Development Installation

🚀 Quick Start

Basic Example

Output

🌍 Supported Languages

⚙️ Configuration

Policy File Structure

Programmatic Configuration

📚 API Reference

SemanticCodeChunker

Constructor

chunk_code() Method

ParsedChunk Data Class

🔬 Advanced Usage

Getting Detailed Chunk Information

Processing Large Files

Custom Language Support

🏗️ Architecture

System Diagram

Data Flow

🤝 Contributing

Setting Up Development Environment

Adding Support for New Language

Running Tests

📄 License

📞 Support & Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SemanticCodeChunker`

`chunk_code()` Method

`ParsedChunk` Data Class