Skip to main content

Semantic code chunking for LLM processing with policy-based configuration

Project description

๐Ÿงฉ Semantic Code Chunker

Semantic code chunking library optimized for LLM processing and RAG systems

PyPI version License: MIT Python 3.9+ Code style: black


๐Ÿ“– Table of Contents


๐ŸŽฏ Overview

Semantic Code Chunker is a Python library that intelligently splits source code into meaningful chunks based on Abstract Syntax Tree (AST) analysis. Unlike naive character-based chunking, this library understands the structure of your code and ensures that:

  • โœ… Functions are not split in half
  • โœ… Classes stay together with their methods
  • โœ… Context is preserved for LLM understanding
  • โœ… Configurable per-language policies

Perfect for RAG systems, code analysis tools, and AI-powered development assistants.


โœจ Features

Feature Description
๐ŸŒณ AST-Based Parsing Uses tree-sitter for accurate code structure understanding
๐ŸŽฏ Semantic Chunking Chunks by functions, classes, methods - not arbitrary positions
โš™๏ธ Policy-Driven YAML configuration per language with fine-grained control
๐ŸŒ Multi-Language Supports 12+ languages including legacy systems (COBOL, JCL)
๐Ÿ”ง Fallback Mode Regex-based parsing when tree-sitter unavailable
๐Ÿ“Š Rich Metadata Each chunk includes type, line numbers, function names
โšก High Performance Optimized for large codebases

๐Ÿ“ฆ Installation

Using pip

pip install rag-semantic-chunker

Using uv (recommended - faster)

uv add rag-semantic-chunker

Development Installation

git clone https://github.com/zthanos/SemanticCodeChunker.git
cd SemanticCodeChunker
uv venv .venv
source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows
uv pip install -e ".[dev]"

๐Ÿš€ Quick Start

Basic Example

from semantic_chunker import SemanticCodeChunker

# Initialize with policy file
chunker = SemanticCodeChunker("policy.yaml")

# Sample Python code
code = """
def hello_world():
    print("Hello World")

class Calculator:
    def add(self, a, b):
        return a + b
    
    def subtract(self, a, b):
        return a - b
"""

# Chunk the code (max 500 chars per chunk for demo)
chunks = chunker.chunk_code(
    language="python", 
    source_code=code,
    context_size=500
)

print(f"Created {len(chunks)} chunks:")
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk[:200] + "..." if len(chunk) > 200 else chunk)

Output

Created 2 chunks:

--- Chunk 1 ---
def hello_world():
    print("Hello World")

--- Chunk 2 ---
class Calculator:
    def add(self, a, b):
        return a + b
    
    def subtract(self, a, b):
        return a - b

๐ŸŒ Supported Languages

Language Parser Target Nodes Notes
Python python functions, classes, decorators Full support
Java java methods, classes, interfaces Enterprise ready
JavaScript javascript functions, arrow funcs, classes ES6+ supported
TypeScript typescript All JS + types, interfaces Type-aware
C c functions, structs, enums Systems programming
C++ cpp Classes, templates, namespaces OOP features
C# c_sharp Methods, properties, events .NET ecosystem
COBOL cobol Paragraphs, sections Legacy/mainframe โš ๏ธ
JCL jcl Jobs, EXEC, DD statements Mainframe jobs โš ๏ธ
CLIST clist Procedures, functions TSO/E commands โš ๏ธ
Visual Basic vbnet Subs, functions, classes VB.NET support
Elixir elixir def, modules, protocols Functional paradigm

โš ๏ธ Note: Legacy languages (COBOL, JCL, CLIST) use fallback regex parsing when tree-sitter grammars unavailable.


โš™๏ธ Configuration

Policy File Structure

Create a policy.yaml file to configure chunking behavior:

defaults:
  max_context_size: 2048      # Default chars per chunk
  min_chunk_size: 100         # Minimum chunk size
  include_comments: true      # Include comments in chunks
  
languages:
  python:
    parser_name: "python"
    target_nodes:
      - "function_definition"
      - "class_definition"
      - "method_definition"
    
    metadata_fields:
      - "node_type"
      - "line_number"
      - "function_name"
      
  java:
    parser_name: "java"
    target_nodes:
      - "method_declaration"
      - "class_declaration"

Programmatic Configuration

from semantic_chunker.config import PolicyConfig, LanguageConfig

# Create custom configuration
config = PolicyConfig(
    max_context_size=4096,
    languages={
        "python": LanguageConfig(
            parser_name="python",
            target_nodes=["function_definition", "class_definition"]
        )
    }
)

chunker = SemanticCodeChunker(config=config)  # Pass config directly

๐Ÿ“š API Reference

SemanticCodeChunker

Main class for chunking code.

Constructor

def __init__(self, policy_path: str | None = None, config: PolicyConfig | None = None):
    """
    Initialize the chunker.
    
    Args:
        policy_path: Path to YAML policy file (optional)
        config: Direct PolicyConfig object (optional)
        
    Raises:
        FileNotFoundError: If policy_path doesn't exist
        ValueError: If configuration is invalid
    """

chunk_code() Method

def chunk_code(
    self, 
    language: str, 
    source_code: str, 
    context_size: int | None = None
) -> List[str]:
    """
    Chunk source code semantically.
    
    Args:
        language: Language identifier (e.g., 'python', 'java')
        source_code: Source code string to chunk
        context_size: Override default max chunk size
        
    Returns:
        List of chunk strings respecting semantic boundaries
        
    Example:
        >>> chunks = chunker.chunk_code("python", code, context_size=2048)
        >>> len(chunks)  # Number of chunks created
        5
    """

ParsedChunk Data Class

Detailed chunk information with metadata.

@dataclass
class ParsedChunk:
    code: str                    # The actual code content
    metadata: ChunkMetadata      # Metadata about the chunk
    
# Accessing metadata
chunk = chunks[0]
print(chunk.metadata.node_type)     # "function_definition"
print(chunk.metadata.line_number)   # 42
print(chunk.metadata.function_name) # "calculate_sum"

๐Ÿ”ฌ Advanced Usage

Getting Detailed Chunk Information

from semantic_chunker import SemanticCodeChunker
from semantic_chunker.parser import parse_and_chunk

# Get chunks with full metadata
chunks = parse_and_chunk(
    source_code=code,
    language="python",
    target_nodes=["function_definition", "class_definition"],
    max_context_size=2048
)

for chunk in chunks:
    print(f"Type: {chunk.metadata.node_type}")
    print(f"Lines: {chunk.metadata.line_number}-{chunk.metadata.end_line}")
    print(f"Function: {chunk.metadata.function_name}")
    print(f"Code:\n{chunk.code}\n")

Processing Large Files

def process_large_file(file_path: str, language: str) -> List[str]:
    """Process large files in streaming fashion."""
    
    with open(file_path, 'r', encoding='utf-8') as f:
        source_code = f.read()
        
    chunker = SemanticCodeChunker("policy.yaml")
    
    # Process with larger context for big files
    chunks = chunker.chunk_code(
        language=language,
        source_code=source_code,
        context_size=4096  # Larger chunks for complex codebases
    )
    
    return chunks

# Usage
chunks = process_large_file("src/main.py", "python")

Custom Language Support

from semantic_chunker.parser import ASTParser

class RustParser(ASTParser):
    """Custom parser for Rust (not yet in tree-sitter-languages)."""
    
    def _get_fallback_patterns(self) -> Dict[str, str]:
        return {
            "function": r'\bfn\s+(\w+)\s*\(',
            "struct": r'\bstruct\s+(\w+)',
            "impl_block": r'\bimpl\s+',
        }

# Use custom parser
rust_parser = RustParser("rust")

๐Ÿ—๏ธ Architecture

System Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Semantic Code Chunker                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚   Policy     โ”‚    โ”‚    AST       โ”‚    โ”‚   Chunk      โ”‚      โ”‚
โ”‚  โ”‚   Config     โ”‚โ”€โ”€โ”€โ–ถโ”‚    Parser    โ”‚โ”€โ”€โ”€โ–ถโ”‚   Extractor  โ”‚      โ”‚
โ”‚  โ”‚  (policy.yaml)โ”‚    โ”‚(tree-sitter) โ”‚    โ”‚              โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚         โ”‚                  โ”‚                    โ”‚               โ”‚
โ”‚         โ–ผ                  โ–ผ                    โ–ผ               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚                   SemanticCodeChunker                    โ”‚   โ”‚
โ”‚  โ”‚                                                         โ”‚   โ”‚
โ”‚  โ”‚  chunk_code(language, source_code, context_size)       โ”‚   โ”‚
โ”‚  โ”‚                                                         โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                              โ”‚                                  โ”‚
โ”‚                              โ–ผ                                  โ”‚
โ”‚                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                       โ”‚
โ”‚                    โ”‚  List[ParsedChunk]โ”‚                       โ”‚
โ”‚                    โ”‚                   โ”‚                       โ”‚
โ”‚                    โ”‚  - code: str      โ”‚                       โ”‚
โ”‚                    โ”‚  - metadata       โ”‚                       โ”‚
โ”‚                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                       โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Flow

  1. Load Policy โ†’ Read YAML configuration for target language
  2. Parse AST โ†’ tree-sitter creates Abstract Syntax Tree
  3. Query Nodes โ†’ Extract functions, classes, methods based on policy
  4. Group Chunks โ†’ Combine nodes respecting context_size limit
  5. Extract Metadata โ†’ Add line numbers, names, types to each chunk

๐Ÿค Contributing

We welcome contributions! Here's how you can help:

Setting Up Development Environment

# Clone repository
git clone https://github.com/zthanos/SemanticCodeChunker.git
cd SemanticCodeChunker

# Create virtual environment
uv venv .venv
source .venv/bin/activate  # Linux/macOS

# Install with dev dependencies
uv pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black src/semantic_chunker tests/

# Type checking
mypy src/semantic_chunker

Adding Support for New Language

  1. Add parser name to LANGUAGE_PARSERS in parser.py
  2. Add default config in config.py โ†’ get_default_languages()
  3. Add target nodes in example policy.yaml
  4. Write tests in tests/test_<language>.py

Running Tests

# All tests
pytest

# Specific test file
pytest tests/test_python_chunking.py

# With coverage
pytest --cov=src/semantic_chunker --cov-report=html

# Verbose output
pytest -v

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ“ž Support & Contact


Made with โค๏ธ for the AI and Developer Community

โญ Star this repo if you find it helpful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_semantic_chunker-0.1.0.tar.gz (108.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_semantic_chunker-0.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file rag_semantic_chunker-0.1.0.tar.gz.

File metadata

  • Download URL: rag_semantic_chunker-0.1.0.tar.gz
  • Upload date:
  • Size: 108.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rag_semantic_chunker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 85f155b48aa376c0c8dbb2bb8690995f6b9122b301120a43a26f86461190eb2d
MD5 6186d2a60345b0a6ec290774e61a2e92
BLAKE2b-256 2d1c9f68c42b58e77ef9d2de666b5edd4f56b75e3b6692708536506fdff52e2f

See more details on using hashes here.

File details

Details for the file rag_semantic_chunker-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for rag_semantic_chunker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ecd589b62bd490e40d314bd098cbeca521767809fdbac833f038ef125002cd8
MD5 4364d0e1d709740a11ac6717781e4355
BLAKE2b-256 c5b2ed31332b54e4d4ae06a45043801f48f7e50a889372872b34e8e7be4da839

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page