Semantic code chunker using Tree-sitter for intelligent code analysis
Project description
Tree-sitter Chunker
A high-performance semantic code chunker that leverages Tree-sitter parsers to intelligently split source code into meaningful chunks like functions, classes, and methods.
PyPI Release Flow: Published releases come from the GitHub release workflow, not from ordinary main CI runs. See docs/packaging.md and docs/development/RELEASE_CHECKLIST.md.
🏗️ Architecture Overview
Tree-sitter Chunker is designed as a modular, high-performance semantic code analysis system. The following C4-style diagram illustrates how the major components fit together:
flowchart TB
subgraph "External Systems"
DEV[👤 Developer<br/>CLI/SDK User]
AGENT[🤖 LLM/Agent<br/>REST API Consumer]
end
subgraph chunker["Tree-sitter Chunker System"]
subgraph interface["Interface Layer"]
CLI[CLI Interface<br/>treesitter-chunker]
API[REST API<br/>FastAPI :8000]
SDK[Python SDK<br/>import chunker]
end
subgraph core["Core Processing"]
CORE[Core Chunker<br/>chunk_file・chunk_text]
TOKEN[Token-Aware Chunker<br/>pack_hint・max_tokens]
REPO[Repository Processor<br/>parallel・git-aware]
end
subgraph lang["Language Support"]
PARSER[Parser Factory<br/>caching・pooling]
PLUGINS[Language Plugins<br/>36+ built-in]
GRAMMAR[Grammar Manager<br/>100+ auto-download]
end
subgraph graph["Graph & Analysis"]
XREF[XRef Builder<br/>build_xref]
CUT[Graph Cut<br/>BFS・scoring]
META[Metadata Extractor<br/>calls・symbols・complexity]
end
subgraph export["Export Layer"]
PG[(PostgreSQL)]
NEO[(Neo4j)]
FILES[JSON・JSONL<br/>Parquet・GraphML]
end
end
subgraph external["External Dependencies"]
TS[🌳 Tree-sitter<br/>AST Parsing]
TIK[🔢 tiktoken<br/>Token Counting]
end
DEV --> CLI & SDK
AGENT --> API
CLI & API & SDK --> CORE
CORE --> TOKEN
REPO --> CORE
CORE --> PARSER
PARSER --> PLUGINS
PARSER --> GRAMMAR
PARSER --> TS
CORE --> META
META --> XREF
XREF --> CUT
TOKEN --> TIK
XREF --> PG & NEO & FILES
CUT --> API
Data Flow: From Code to Chunks
flowchart LR
subgraph input["📥 Input"]
FILE[Source Files]
TEXT[Code Text]
end
subgraph process["⚙️ Processing Pipeline"]
PARSE[1. Parse<br/>Tree-sitter AST]
WALK[2. Walk<br/>Extract Nodes]
CHUNK[3. Chunk<br/>Create CodeChunk]
ENRICH[4. Enrich<br/>Metadata・Tokens]
end
subgraph output["📤 Output"]
CHUNKS[CodeChunks<br/>with stable IDs]
GRAPH[Graph Model<br/>nodes・edges]
SPANS[Byte Spans<br/>file_id・symbol_id]
end
FILE & TEXT --> PARSE --> WALK --> CHUNK --> ENRICH
ENRICH --> CHUNKS --> GRAPH & SPANS
For detailed architecture documentation, see docs/architecture.md.
📊 Performance Benchmarks
Tree-sitter Chunker is designed for high-performance code analysis:
| Metric | Performance | Comparison |
|---|---|---|
| Speed | 11.9x faster with AST caching | vs. repeated parsing |
| Memory | Streaming support for 10GB+ files | vs. loading entire files |
| Languages | 36+ built-in, 100+ auto-download | vs. manual grammar setup |
| Parallel | 8x speedup on 8-core systems | vs. single-threaded |
| Cache Hit | 95%+ for repeated files | vs. no caching |
✨ Key Features
- 🎯 Semantic Understanding - Extracts functions, classes, methods based on AST
- 🚀 Blazing Fast - 11.9x speedup with intelligent AST caching
- 🌍 Universal Language Support - Auto-download and support for 100+ Tree-sitter grammars
- 🔌 Plugin Architecture - Built-in plugins for 29 languages + auto-download support for 100+ more including all major programming languages
- 🎛️ Flexible Configuration - TOML/YAML/JSON config files with per-language settings
- 📊 14 Export Formats - JSON, JSONL, Parquet, CSV, XML, GraphML, Neo4j, DOT, SQLite, PostgreSQL, and more
- ⚡ Parallel Processing - Process entire codebases with configurable workers
- 🌊 Streaming Support - Handle files larger than memory
- 🎨 Rich CLI - Progress bars, batch processing, and filtering
- 🤖 LLM-Ready - Token counting, chunk optimization, and context-aware splitting
- 📝 Text File Support - Markdown, logs, config files with intelligent chunking
- 🔍 Advanced Query - Natural language search across your codebase
- 📈 Graph Export - Visualize code structure in yEd, Neo4j, or Graphviz
- 🐛 Debug Tools - AST visualization, chunk inspection, performance profiling
- 🔧 Developer Tools - Pre-commit hooks, CI/CD generation, quality metrics
- 📦 Multi-Platform Distribution - PyPI, Docker, Homebrew packages
- 🌐 Zero-Configuration - Automatic language detection and grammar download
- 🚀 Production Ready - Prebuilt wheels with embedded grammars, no local compilation required
📦 Installation
Prerequisites
- Python 3.11+ (for Python usage)
- C compiler (for building Tree-sitter grammars - only needed if using languages not included in prebuilt wheels)
Installation Methods
From PyPI (Recommended)
# Install the latest stable version
pip install treesitter-chunker
# With REST API support
pip install "treesitter-chunker[api]"
# With visualization tools (requires graphviz system package)
pip install "treesitter-chunker[viz]"
# With all optional dependencies
pip install "treesitter-chunker[all]"
Using UV (Fast Python Package Manager)
# Install UV if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install the latest stable version
uv pip install treesitter-chunker
# With REST API support
uv pip install "treesitter-chunker[api]"
# With visualization tools
uv pip install "treesitter-chunker[viz]"
# With all optional dependencies
uv pip install "treesitter-chunker[all]"
Note: Prebuilt wheels include compiled Tree-sitter grammars for common languages (Python, JavaScript, Rust, C, C++), so no local compilation is required for basic usage.
No Local Builds Required
Current PyPI releases include precompiled Tree-sitter grammars for common languages. This means:
- ✅ Immediate Use: No C compiler or build tools required for basic languages
- ✅ Faster Installation: Wheels install instantly without compilation
- ✅ Consistent Performance: Same grammar versions across all installations
- ✅ Offline Capable: Works without internet access after installation
Supported Languages in Prebuilt Wheels:
- Python, JavaScript, TypeScript, JSX, TSX
- C, C++, Rust
- Additional languages can be built on-demand if needed
🌍 Language Support Matrix
| Language | Status | Plugin | Auto-Download | Prebuilt |
|---|---|---|---|---|
| Python | ✅ Production | ✅ Built-in | ✅ Available | ✅ Included |
| JavaScript/TypeScript | ✅ Production | ✅ Built-in | ✅ Available | ✅ Included |
| Rust | ✅ Production | ✅ Built-in | ✅ Available | ✅ Included |
| C/C++ | ✅ Production | ✅ Built-in | ✅ Available | ✅ Included |
| Go | ✅ Production | ✅ Built-in | ✅ Available | 🔧 Buildable |
| Java | ✅ Production | ✅ Built-in | ✅ Available | 🔧 Buildable |
| Ruby | ✅ Production | ✅ Built-in | ✅ Available | 🔧 Buildable |
| PHP | ✅ Production | ✅ Built-in | ✅ Available | 🔧 Buildable |
| C# | ✅ Production | ✅ Built-in | ✅ Available | 🔧 Buildable |
| Swift | ✅ Production | ✅ Built-in | ✅ Available | 🔧 Buildable |
| Kotlin | ✅ Production | ✅ Built-in | ✅ Available | 🔧 Buildable |
| + 26 more | ✅ Production | ✅ Built-in | ✅ Available | 🔧 Buildable |
Legend: ✅ Production Ready, 🔧 Buildable on-demand, 🚧 Experimental
For Advanced Usage: If you need languages not included in prebuilt wheels, the package can still build them locally using the same build system used during wheel creation.
For Other Languages
See Cross-Language Usage Guide for using from JavaScript, Go, Ruby, etc.
Using Docker
docker pull ghcr.io/consiliency/treesitter-chunker:latest
docker run -v $(pwd):/workspace treesitter-chunker chunk /workspace/example.py -l python
Using Homebrew (macOS/Linux)
brew tap consiliency/treesitter-chunker
brew install treesitter-chunker
For Debian/Ubuntu
# Download .deb package from releases
sudo dpkg -i python3-treesitter-chunker_1.0.0-1_all.deb
For Fedora/RHEL
# Download .rpm package from releases
sudo rpm -i python-treesitter-chunker-1.0.0-1.noarch.rpm
Quick Install (Development)
# Clone the repository
git clone https://github.com/Consiliency/treesitter-chunker.git
cd treesitter-chunker
# Install with uv (recommended)
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"
uv pip install git+https://github.com/tree-sitter/py-tree-sitter.git
# Build language grammars
python scripts/fetch_grammars.py
python scripts/build_lib.py
# Verify installation
python -c "from chunker.parser import list_languages; print(list_languages())"
# Output: ['c', 'cpp', 'javascript', 'python', 'rust']
Grammar Setup
Tree-sitter Chunker requires compiled grammar libraries for parsing. Prebuilt wheels include common languages (Python, JavaScript, Rust), but you can set up additional grammars using the CLI.
CLI Setup (Recommended)
# Set up default languages (python, javascript, rust)
treesitter-chunker setup grammars
# Set up specific languages
treesitter-chunker setup grammars python go java typescript
# Set up all extended languages (10 common languages)
treesitter-chunker setup grammars --all
# Check setup status
treesitter-chunker setup status
# List all available grammars
treesitter-chunker setup list-available
# Clean up grammar files
treesitter-chunker setup clean --builds # Remove built libraries only
treesitter-chunker setup clean --all # Remove sources and builds
Environment Configuration
You can customize where grammars are stored:
# Set custom build directory (persists across installations)
export CHUNKER_GRAMMAR_BUILD_DIR="$HOME/.cache/treesitter-chunker/build"
Programmatic Setup
For advanced use cases, you can set up grammars programmatically:
from pathlib import Path
from chunker.grammar.manager import TreeSitterGrammarManager
cache = Path.home() / ".cache" / "treesitter-chunker"
gm = TreeSitterGrammarManager(grammars_dir=cache / "grammars", build_dir=cache / "build")
gm.add_grammar("python", "https://github.com/tree-sitter/tree-sitter-python")
gm.fetch_grammar("python")
gm.build_grammar("python")
Requirements for Building Grammars
Building grammars from source requires:
- C compiler (gcc, clang, or MSVC)
- Git (for fetching grammar sources)
- Python development headers (usually included with Python installation)
🚀 Quick Start
Python Usage
from chunker import chunk_file, chunk_text, chunk_directory
# Extract chunks from a Python file
chunks = chunk_file("example.py", "python")
# Or chunk text directly
chunks = chunk_text(code_string, "javascript")
for chunk in chunks:
print(f"{chunk.node_type} at lines {chunk.start_line}-{chunk.end_line}")
print(f" Context: {chunk.parent_context or 'module level'}")
Incremental Processing
Efficiently detect changes after edits and update only what changed:
from chunker import DefaultIncrementalProcessor, chunk_file
from pathlib import Path
processor = DefaultIncrementalProcessor()
file_path = Path("example.py")
old_chunks = chunk_file(file_path, "python")
processor.store_chunks(str(file_path), old_chunks)
# ... modify example.py ...
new_chunks = chunk_file(file_path, "python")
# API 1: file path + new chunks
diff = processor.compute_diff(str(file_path), new_chunks)
for added in diff.added:
print("Added:", added.chunk_id)
# API 2: old chunks + new text + language
# diff = processor.compute_diff(old_chunks, file_path.read_text(), "python")
Smart Context and Natural-Language Query (optional)
Advanced features are optional at import time (NumPy/PyArrow heavy deps); when available:
from chunker import (
TreeSitterSmartContextProvider,
InMemoryContextCache,
AdvancedQueryIndex,
NaturalLanguageQueryEngine,
)
from chunker import chunk_file
chunks = chunk_file("api/server.py", "python")
# Semantic context
ctx = TreeSitterSmartContextProvider(cache=InMemoryContextCache(ttl=3600))
context, metadata = ctx.get_semantic_context(chunks[0])
# Query
index = AdvancedQueryIndex()
index.build_index(chunks)
engine = NaturalLanguageQueryEngine()
results = engine.search("API endpoints", chunks)
for r in results[:3]:
print(r.score, r.chunk.node_type)
Streaming Large Files
from chunker import chunk_file_streaming
for chunk in chunk_file_streaming("big.sql", language="sql"):
print(chunk.node_type, chunk.start_line, chunk.end_line)
Cross-Language Usage
# CLI with JSON output (callable from any language)
treesitter-chunker chunk file.py --lang python --json
# REST API
curl -X POST http://localhost:8000/chunk/text \
-H "Content-Type: application/json" \
-d '{"content": "def hello(): pass", "language": "python"}'
See Cross-Language Usage Guide for JavaScript, Go, and other language examples.
Note: By default, chunks smaller than 3 lines are filtered out. Adjust
min_chunk_sizein configuration if needed.
Zero-Configuration Usage (New!)
from chunker.auto import ZeroConfigAPI
# Create API instance - no setup required!
api = ZeroConfigAPI()
# Automatically detects language and downloads grammar if needed
result = api.auto_chunk_file("example.rs")
for chunk in result.chunks:
print(f"{chunk.node_type} at lines {chunk.start_line}-{chunk.end_line}")
# Preload languages for offline use
api.preload_languages(["python", "rust", "go", "typescript"])
Using Plugins
from chunker.core import chunk_file
from chunker.plugin_manager import get_plugin_manager
# Load built-in language plugins
manager = get_plugin_manager()
manager.load_built_in_plugins()
# Now chunking uses plugin-based rules
chunks = chunk_file("example.py", "python")
Parallel Processing
from chunker.parallel import chunk_files_parallel, chunk_directory_parallel
# Process multiple files in parallel
results = chunk_files_parallel(
["file1.py", "file2.py", "file3.py"],
"python",
max_workers=4,
show_progress=True
)
# Process entire directory
results = chunk_directory_parallel(
"src/",
"python",
pattern="**/*.py"
)
Build Wheels (for contributors)
The build system supports environment flags to speed up or stabilize local builds:
# Limit grammars included in combined wheels (comma-separated subset)
export CHUNKER_WHEEL_LANGS=python,javascript,rust
# Verbose build logs
export CHUNKER_BUILD_VERBOSE=1
# Optional build timeout in seconds (per compilation unit)
export CHUNKER_BUILD_TIMEOUT=240
Export Formats
from chunker.core import chunk_file
from chunker.export.json_export import JSONExporter, JSONLExporter
from chunker.export.formatters import SchemaType
from chunker.exporters.parquet import ParquetExporter
chunks = chunk_file("example.py", "python")
# Export to JSON with nested schema
json_exporter = JSONExporter(schema_type=SchemaType.NESTED)
json_exporter.export(chunks, "chunks.json")
# Export to JSONL for streaming
jsonl_exporter = JSONLExporter()
jsonl_exporter.export(chunks, "chunks.jsonl")
# Export to Parquet for analytics
parquet_exporter = ParquetExporter(compression="snappy")
parquet_exporter.export(chunks, "chunks.parquet")
CLI Usage
# Basic chunking
treesitter-chunker chunk example.py -l python
# Process directory with progress bar
treesitter-chunker batch src/ --recursive
# Export as JSON
treesitter-chunker chunk example.py -l python --json > chunks.json
# With configuration file
treesitter-chunker chunk src/ --config .chunkerrc
# Override exclude patterns (default excludes files with 'test' in name)
treesitter-chunker batch src/ --exclude "*.tmp,*.bak" --include "*.py"
# List available languages
treesitter-chunker languages
# Get help for specific commands
treesitter-chunker chunk --help
treesitter-chunker batch --help
Zero-Config CLI (auto-detection)
# Automatically detect language and chunk a file
treesitter-chunker auto-chunk example.rs
# Auto-chunk a directory using detection + intelligent fallbacks
treesitter-chunker auto-batch repo/
Debug and Visualization
# Debug commands (requires graphviz or install with [viz] extra)
treesitter-chunker debug --help
# AST visualization (requires graphviz system package)
python scripts/visualize_ast.py example.py --lang python --out example.svg
VS Code Extension
The Tree-sitter Chunker VS Code extension provides integrated chunking capabilities:
-
Install the extension: Search for "TreeSitter Chunker" in VS Code marketplace
-
Commands available:
TreeSitter Chunker: Chunk Current File- Analyze the active fileTreeSitter Chunker: Chunk Workspace- Process all supported filesTreeSitter Chunker: Show Chunks- View chunks in a webviewTreeSitter Chunker: Export Chunks- Export to JSON/JSONL/Parquet
-
Features:
- Visual chunk boundaries in the editor
- Context menu integration
- Configurable chunk types per language
- Progress tracking for large operations
🎯 Features
Plugin Architecture
The chunker uses a flexible plugin system for language support:
- Built-in Plugins: 29 languages with dedicated plugins: Python, JavaScript (includes TypeScript/TSX), Rust, C, C++, Go, Ruby, Java, Dockerfile, SQL, MATLAB, R, Julia, OCaml, Haskell, Scala, Elixir, Clojure, Dart, Vue, Svelte, Zig, NASM, WebAssembly, XML, YAML, TOML
- Auto-Download Support: 100+ additional languages via automatic grammar download including PHP, Kotlin, C#, Swift, CSS, HTML, JSON, and many more
- Custom Plugins: Easy to add new languages using the TemplateGenerator
- Configuration: Per-language chunk types and rules
- Hot Loading: Load plugins from directories
Performance Features
- AST Caching: 11.9x speedup for repeated processing
- Parallel Processing: Utilize multiple CPU cores
- Streaming: Process files larger than memory
- Progress Tracking: Rich progress bars with ETA
Configuration System
Support for multiple configuration formats:
# .chunkerrc
min_chunk_size = 3
max_chunk_size = 300
[languages.python]
chunk_types = ["function_definition", "class_definition", "async_function_definition"]
min_chunk_size = 5
Export Formats
- JSON: Human-readable, supports nested/flat/relational schemas
- JSONL: Line-delimited JSON for streaming
- Parquet: Columnar format for analytics with compression
Recent Feature Additions
Phase 9 Features (Completed)
- Token Integration: Count tokens for LLM context windows
- Chunk Hierarchy: Build hierarchical chunk relationships
- Metadata Extraction: Extract TODOs, complexity metrics, etc.
- Semantic Merging: Intelligently merge related chunks
- Custom Rules: Define custom chunking rules per language
- Repository Processing: Process entire repositories efficiently
- Overlapping Fallback: Handle edge cases with smart fallbacks
- Cross-Platform Packaging: Distribute as wheels for all platforms
Phase 14: Universal Language Support (Completed)
- Automatic Grammar Discovery: Discovers 100+ Tree-sitter grammars from GitHub
- On-Demand Download: Downloads and compiles grammars automatically when needed
- Zero-Configuration API: Simple API that just works without setup
- Smart Caching: Local cache with 24-hour refresh for offline use
- Language Detection: Automatic language detection from file extensions
Phase 15: Production Readiness & Comprehensive Testing (Completed)
- 900+ Tests: All tests passing across unit, integration, and language-specific test suites
- Test Fixes: Fixed fallback warnings, CSV header inclusion, and large file streaming
- Comprehensive Methodology: Full testing coverage for security, performance, reliability, and operations
- 36+ Languages: Production-ready support for all programming languages
Phase 19: Comprehensive Language Expansion (Completed)
- Template Generator: Automated plugin and test generation with Jinja2
- Grammar Manager: Dynamic grammar source management with parallel compilation
- 36+ Built-in Languages: Added 22 new language plugins across 4 tiers
- Contract-Driven Development: Clean component boundaries for parallel implementation
- ExtendedLanguagePluginContract: Enhanced contract for consistent plugin behavior
🔧 Troubleshooting
Common Issues & Solutions
Grammar Build Failures
# If you encounter grammar compilation errors:
export CHUNKER_GRAMMAR_BUILD_DIR="$HOME/.cache/treesitter-chunker/build"
python -c "from chunker.grammar.manager import TreeSitterGrammarManager; gm = TreeSitterGrammarManager(); gm.build_grammar('python')"
Memory Issues with Large Files
# Use streaming for files larger than memory:
from chunker import chunk_file_streaming
chunks = chunk_file_streaming("large_file.py", "python", chunk_size=1000)
Language Detection Issues
# Force language detection:
from chunker import chunk_file
chunks = chunk_file("file.xyz", language="python", force_language=True)
Performance Optimization
# Enable AST caching for repeated processing:
from chunker import ASTCache
cache = ASTCache(max_size=1000)
# Cache is automatically used by chunk_file()
Getting Help
- Documentation: Full documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Examples: Cookbook with working examples
📚 API Overview
Tree-sitter Chunker exports 110+ APIs organized into logical groups:
Core Functions
chunk_file()- Extract chunks from a fileCodeChunk- Data class representing a chunkchunk_text()- Chunk raw source text (convenience wrapper)chunk_directory()- Parallel directory chunking (convenience alias)
Parser Management
get_parser()- Get parser for a languagelist_languages()- List available languagesget_language_info()- Get language metadatareturn_parser()- Return parser to poolclear_cache()- Clear parser cache
Plugin System
PluginManager- Manage language pluginsLanguagePlugin- Base class for pluginsPluginConfig- Plugin configurationget_plugin_manager()- Get global plugin manager
Performance Features
chunk_files_parallel()- Process files in parallelchunk_directory_parallel()- Process directorieschunk_file_streaming()- Stream large filesASTCache- Cache parsed ASTsStreamingChunker- Streaming chunker classParallelChunker- Parallel processing class
Incremental Processing
DefaultIncrementalProcessor- Compute diffs between old/new chunksDefaultChangeDetector,DefaultChunkCache- Helpers and caching
Advanced Query (optional)
AdvancedQueryIndex- Text/AST/embedding indexesNaturalLanguageQuery- Query code using natural languageSemanticSearch- Find code by meaning, not just text
🤝 Contributing
We welcome contributions! Tree-sitter Chunker is built by the community for the community.
How to Contribute
- Fork the repository and create a feature branch
- Make your changes following our coding standards
- Add tests for new functionality
- Update documentation as needed
- Submit a pull request with a clear description
Development Setup
# Clone and setup development environment
git clone https://github.com/Consiliency/treesitter-chunker.git
cd treesitter-chunker
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"
# Run tests
pytest
# Build documentation
mkdocs serve
Contribution Guidelines
- Code Style: Follow PEP 8 and use Black for formatting
- Testing: Maintain 95%+ test coverage
- Documentation: Update docs for all new features
- Performance: Consider performance impact of changes
Getting Help
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Contributing Guide
🔐 Stable IDs & Spans
Tree-sitter Chunker generates stable, deterministic identifiers for all code entities, enabling reliable cross-referencing and incremental processing.
ID Generation
from chunker.types import CodeChunk
chunk = chunk_file("example.py", "python")[0]
# Stable identifiers (SHA1-based, deterministic)
chunk.node_id # Unique ID based on file + language + route + content hash
chunk.file_id # Hash of the file path
chunk.symbol_id # Hash of language + file + symbol name
chunk.chunk_id # Full 40-char SHA1 for backward compatibility
# Hierarchical context
chunk.parent_route # ["module", "ClassName", "method_name"]
chunk.parent_context # "ClassName" (immediate parent)
Byte-Accurate Spans
Every chunk includes precise byte offsets for source mapping:
chunk.byte_start # Start byte offset in file
chunk.byte_end # End byte offset in file
chunk.start_line # 1-indexed start line
chunk.end_line # 1-indexed end line
These spans are propagated through:
- Repository processing and incremental watch
- Graph exporters (PostgreSQL, Neo4j)
- REST API responses
- XRef graph nodes
📊 Unified Graph Model
The chunker uses a unified graph model for cross-reference analysis, graph export, and agent platform integration.
Graph Node Schema
@dataclass
class UnifiedGraphNode:
id: str # Stable node ID (node_id from CodeChunk)
file: str # Source file path
lang: str # Language identifier
symbol: str | None # Symbol name (function/class name)
kind: str # Node type (function_definition, class_definition)
attrs: dict[str, Any] # Metadata (token_count, complexity, change_freq)
Graph Edge Schema
@dataclass
class UnifiedGraphEdge:
src: str # Source node ID
dst: str # Destination node ID
type: str # Relationship type (CALLS, DEFINES, IMPORTS, INHERITS)
weight: float # Edge weight (default 1.0)
Building Cross-Reference Graphs
from chunker import chunk_file
from chunker.graph.xref import build_xref
chunks = chunk_file("src/app.py", "python")
nodes, edges = build_xref(chunks)
# nodes: list of dicts with {id, file, lang, symbol, kind, attrs}
# edges: list of dicts with {src, dst, type, weight}
Graph Cut for Context Selection
Extract minimal subgraphs for LLM context:
from chunker.graph.cut import graph_cut
# Select nodes within 2 hops of seeds, up to 200 nodes
selected_ids, induced_edges = graph_cut(
seeds=["function_abc_node_id"],
nodes=nodes,
edges=edges,
radius=2, # BFS depth
budget=200, # Max nodes to return
weights={
"distance": 1.0, # Favor nodes closer to seeds
"publicness": 0.5, # Favor high out-degree nodes
"hotspots": 0.3, # Favor frequently changed nodes
}
)
🤖 Agent Platform REST API
Tree-sitter Chunker provides a REST API for LLM agents and external tools.
Starting the Server
# Install with API support
pip install "treesitter-chunker[api]"
# Start server
uvicorn api.server:app --host 0.0.0.0 --port 8000
# Or run directly
python -m api.server
Available Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API info and available endpoints |
/health |
GET | Health check |
/languages |
GET | List supported languages |
/chunk/text |
POST | Chunk source code text |
/chunk/file |
POST | Chunk file from filesystem |
/graph/xref |
POST | Build cross-reference graph |
/graph/cut |
POST | Extract subgraph via BFS |
/export/postgres |
POST | Export to PostgreSQL |
/nearest-tests |
POST | Find related test files |
Example: Chunk Code via API
curl -X POST http://localhost:8000/chunk/text \
-H "Content-Type: application/json" \
-d '{
"content": "def hello():\n return \"world\"",
"language": "python"
}'
Example: Build Graph and Cut
# Step 1: Build cross-reference graph
curl -X POST http://localhost:8000/graph/xref \
-H "Content-Type: application/json" \
-d '{"paths": ["/path/to/file.py"]}'
# Step 2: Extract subgraph around seeds
curl -X POST http://localhost:8000/graph/cut \
-H "Content-Type: application/json" \
-d '{
"seeds": ["node_id_1", "node_id_2"],
"nodes": [...],
"edges": [...],
"params": {"radius": 2, "budget": 100}
}'
API Documentation
Interactive API docs available at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
🛡️ Security Posture
Tree-sitter Chunker follows security best practices for production deployments.
Exception Handling
- No bare
except:clauses - All exception handlers specify explicit exception types - Structured error logging - Errors are logged with context for debugging
- Graceful degradation - Failures in non-critical paths don't crash the system
SQL Injection Prevention
- Parameterized queries - Database exporters use parameterized SQL for all user data
- Safe script generation - SQL file exports use proper escaping for string literals
- Separated concerns - Direct DB access uses
executemany()with parameters; file export uses escaped literals
# Internal parameterized query pattern
cursor.executemany(
"INSERT INTO chunks (id, content) VALUES (%s, %s)",
[(chunk.id, chunk.content) for chunk in chunks]
)
Shell Command Safety
- Argument lists - Subprocess calls use
shell=Falsewith argument lists - Input validation - File paths and language names are validated before use
- No string interpolation - Commands are built from safe argument arrays
# Safe subprocess pattern
subprocess.run(
["git", "diff", "--name-only", commit_hash],
capture_output=True,
check=True,
)
Input Validation
- File size limits - Streaming mode for large files prevents memory exhaustion
- Parser timeouts - Tree-sitter parsing has configurable timeouts
- Path validation - File operations validate paths exist and are accessible
- Encoding safety - Text decoding uses
errors="replace"for malformed input
Thread Safety
- Immutable registries - Language registry is read-only after initialization
- Synchronized access - Parser factory uses locks for thread-safe caching
- No shared mutable state - CodeChunk objects are independent
🔢 LLM Token Packing
The pack_hint metadata helps prioritize chunks for LLM context windows.
Pack Hint Calculation
from chunker.packing import compute_pack_hint
# Returns float in [0.0, 1.0] - higher = more important
hint = compute_pack_hint(chunk)
# Factors considered:
# - token_count (smaller = higher hint)
# - complexity (higher cyclomatic = higher hint)
# - degree (more xref connections = higher hint)
# - recent_changes (frequently modified = higher hint)
Automatic Token Enrichment
from chunker.token.chunker import TreeSitterTokenAwareChunker
chunker = TreeSitterTokenAwareChunker()
chunks = chunker.chunk_file("app.py", "python")
for chunk in chunks:
print(f"{chunk.node_type}: {chunk.metadata['token_count']} tokens")
print(f" pack_hint: {chunk.metadata['pack_hint']:.2f}")
Token-Limited Chunking
from chunker import chunk_text_with_token_limit
# Automatically split chunks exceeding token limit
chunks = chunk_text_with_token_limit(
code,
language="python",
max_tokens=4000, # GPT-4 context budget
model="gpt-4"
)
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Tree-sitter: For the excellent parsing infrastructure
- Contributors: Everyone who has helped improve this project
- Community: Users and developers who provide feedback and ideas
Made with ❤️ by the Tree-sitter Chunker community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file treesitter_chunker-2.2.21.tar.gz.
File metadata
- Download URL: treesitter_chunker-2.2.21.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71c29e1ee21517bad91fe4303548b4deb2b9825b254559c497ac6d72770b547b
|
|
| MD5 |
34dc6f00e90391687b35f657bf8a5b90
|
|
| BLAKE2b-256 |
4a11ed2ecd17915c7b123060deed40053e1306d158d6ff3ec2e5a30aaa1bc619
|
Provenance
The following attestation bundles were made for treesitter_chunker-2.2.21.tar.gz:
Publisher:
release.yml on Consiliency/treesitter-chunker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
treesitter_chunker-2.2.21.tar.gz -
Subject digest:
71c29e1ee21517bad91fe4303548b4deb2b9825b254559c497ac6d72770b547b - Sigstore transparency entry: 1097148744
- Sigstore integration time:
-
Permalink:
Consiliency/treesitter-chunker@55a601927a9897b8f76f3f18ae51c0c9664ba9a9 -
Branch / Tag:
refs/tags/v2.2.21 - Owner: https://github.com/Consiliency
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@55a601927a9897b8f76f3f18ae51c0c9664ba9a9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file treesitter_chunker-2.2.21-py3-none-any.whl.
File metadata
- Download URL: treesitter_chunker-2.2.21-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38b02a732878eeab97e6d133390b9715290457b642ad347ad3f48e0793332ff8
|
|
| MD5 |
f5ec8e7c8f19d8f128d689036d7d4171
|
|
| BLAKE2b-256 |
504f5ec7dd6011c248743b6d4ec04d38c2a8a3b097940ee6968304c3ac7233bc
|
Provenance
The following attestation bundles were made for treesitter_chunker-2.2.21-py3-none-any.whl:
Publisher:
release.yml on Consiliency/treesitter-chunker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
treesitter_chunker-2.2.21-py3-none-any.whl -
Subject digest:
38b02a732878eeab97e6d133390b9715290457b642ad347ad3f48e0793332ff8 - Sigstore transparency entry: 1097148748
- Sigstore integration time:
-
Permalink:
Consiliency/treesitter-chunker@55a601927a9897b8f76f3f18ae51c0c9664ba9a9 -
Branch / Tag:
refs/tags/v2.2.21 - Owner: https://github.com/Consiliency
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@55a601927a9897b8f76f3f18ae51c0c9664ba9a9 -
Trigger Event:
push
-
Statement type: