A pragmatic multi-language code parser optimized for LLM applications
Project description
Code Chunker
A pragmatic multi-language code parser optimized for LLM applications and RAG systems.
Features
- Multi-language support: Python, JavaScript, TypeScript, Solidity, Go, Rust
- Optimized for LLMs: Provides structured output ideal for language models
- Lightweight: Minimal dependencies, fast parsing
- Configurable: Adjust chunk sizes, confidence thresholds, and more
- Easy to use: Simple API with both file and directory parsing
- Incremental parsing: Efficiently update parse results when code changes
- Enhanced language support:
- TypeScript/React: Component, Hook, and Context detection
- Solidity: Smart contract metadata extraction (visibility, modifiers, payable)
- Go: Concurrency pattern detection (goroutines, channels, mutexes)
Installation
pip install code-chunker
Quick Start
from code_chunker import CodeChunker
# Initialize the chunker
chunker = CodeChunker()
# Parse a code string
code = """
def hello_world():
print("Hello, World!")
"""
result = chunker.parse(code, language='python')
# Print the chunks
for chunk in result.chunks:
print(f"{chunk.type.value}: {chunk.name} (lines {chunk.start_line}-{chunk.end_line})")
# Parse a file
result = chunker.parse_file('example.py')
# Parse a directory
results = chunker.parse_directory('src/')
Configuration
from code_chunker import CodeChunker, ChunkerConfig
config = ChunkerConfig(
max_chunk_size=2000,
min_chunk_size=100,
include_comments=True,
confidence_threshold=0.8
)
chunker = CodeChunker(config=config)
Incremental Parsing
Incremental parsing allows you to efficiently update parse results when code changes, without reparsing the entire file.
from code_chunker import CodeChunker, IncrementalParser
# Initialize the incremental parser
incremental_parser = IncrementalParser()
# First parse (full parse)
result1 = incremental_parser.full_parse("path/to/file.py")
# After file changes, perform an incremental parse
result2 = incremental_parser.incremental_parse("path/to/file.py")
# Compare the results
print(f"Full parse chunks: {len(result1.chunks)}")
print(f"Incremental parse chunks: {len(result2.chunks)}")
Enhanced Language Support
TypeScript/React Support
Code Chunker provides specialized support for React components, hooks, and contexts:
from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case
# Get React-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('typescript', 'react'))
chunker = CodeChunker(config=config)
# Parse React component
result = chunker.parse(react_code, language='typescript')
# Filter for React components
components = [chunk for chunk in result.chunks if chunk.type.value == 'component']
for component in components:
print(f"Component: {component.name} (type: {component.metadata.get('component_type')})")
Solidity Smart Contract Support
Enhanced metadata extraction for smart contracts:
from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case
# Get Solidity-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('solidity', 'contract'))
chunker = CodeChunker(config=config)
# Parse Solidity contract
result = chunker.parse(contract_code, language='solidity')
# Find payable functions
payable_functions = [
chunk for chunk in result.chunks
if chunk.type.value == 'function' and chunk.metadata.get('is_payable', False)
]
Go Concurrency Pattern Detection
Automatically detect concurrency patterns in Go code:
from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case
# Get Go-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('go', 'performance'))
chunker = CodeChunker(config=config)
# Parse Go code
result = chunker.parse(go_code, language='go')
# Find functions with goroutines
concurrent_funcs = [
chunk for chunk in result.chunks
if chunk.type.value in ['function', 'method']
and 'goroutines' in chunk.metadata.get('concurrency_patterns', {})
]
Supported Languages
- Python (.py)
- JavaScript (.js, .jsx)
- TypeScript (.ts, .tsx)
- Solidity (.sol)
- Go (.go)
- Rust (.rs)
Examples
The examples/ directory contains several examples demonstrating different features:
Basic Usage
Simple parsing examples:
python examples/basic_usage.py
Advanced Usage
Custom configuration and analysis:
python examples/advanced_usage.py
Incremental Parsing
Efficient parsing of code changes:
python examples/incremental_parsing.py
RAG Integration
Integration with RAG systems:
python examples/rag_integration.py
Edge Cases
Testing various edge cases across languages:
python examples/edge_cases.py
Performance Analysis
Analyze parsing performance:
python examples/performance_analysis.py
Code Quality Analysis
Analyze code quality metrics:
python examples/quality_analysis.py <file_path>
Visualization
Generate code structure visualization:
python examples/visualization.py <file_path>
API Reference
CodeChunker
The main class for parsing code.
chunker = CodeChunker(config=None)
Methods
parse(code: str, language: str) -> ParseResult: Parse a code stringparse_file(file_path: Union[str, Path]) -> ParseResult: Parse a fileparse_directory(directory: Union[str, Path], recursive: bool = True, extensions: Optional[List[str]] = None) -> List[ParseResult]: Parse a directory
IncrementalParser
For efficient incremental parsing.
parser = IncrementalParser(chunker=None)
Methods
full_parse(file_path: str) -> ParseResult: Perform a full parse and cache the resultparse_incremental(file_path: str, changes: List[Tuple[int, int, str]]) -> ParseResult: Parse incrementally based on changesinvalidate_cache(file_path: Optional[str] = None) -> None: Invalidate cache for a file or all files
How Incremental Parsing Works
- Initial Parse: The first parse of a file is a full parse, which is cached
- Change Detection: When changes are made, only affected code regions are identified
- Selective Reparsing: Only affected chunks are reparsed, preserving the rest
- Result Merging: Updated chunks are merged with unchanged chunks
- Smart Caching: Results are cached for future incremental updates
ParseResult
The result of parsing code.
Attributes
language: str: The programming languagefile_path: Optional[str]: Path to the source filechunks: List[CodeChunk]: List of code chunksimports: List[Import]: List of importsexports: List[str]: List of exportsraw_code: str: The original code
CodeChunk
Represents a piece of code.
Attributes
type: ChunkType: The type of chunk (function, class, etc.)name: Optional[str]: The name of the chunkcode: str: The actual codestart_line: int: Starting line numberend_line: int: Ending line numberlanguage: str: Programming languageconfidence: float: Confidence score (0-1)metadata: Dict[str, Any]: Additional metadata
Dependencies
- For basic usage: No external dependencies
- For performance analysis:
psutil - For visualization: Modern web browser to view generated HTML
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Development Setup
- Clone the repository
- Install development dependencies:
pip install -e ".[dev]"
- Run tests:
pytest
- Format code:
black code_chunker/
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
If you find this project helpful, consider supporting its development:
- ⭐ Star this repository
- 🐛 Report bugs and suggest features
- 🤝 Submit pull requests
- 💰 EVM(ETH, ARB, BNB, OP..etc):
0x8f74959530dba14394b27faac92955aa96927e8b
Acknowledgments
Thanks to all contributors and the open-source community for their support.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file code_chunker-1.3.2.tar.gz.
File metadata
- Download URL: code_chunker-1.3.2.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73b68e18bc092d52aca7d70d52a6feee9f8ca9446863bbdb34110b9d899c46aa
|
|
| MD5 |
4c0e93a722ecb2b7feaa72d3b7cfbb89
|
|
| BLAKE2b-256 |
71e567357da7b5f87bd200bc96a19e593c63f0c4794f2fab9780f0c452b560b5
|
File details
Details for the file code_chunker-1.3.2-py3-none-any.whl.
File metadata
- Download URL: code_chunker-1.3.2-py3-none-any.whl
- Upload date:
- Size: 39.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e7bbea5ddd94a61c5dad571f79f8f2640c713cf237b671ad16f8c353b22a8d5
|
|
| MD5 |
6157a4ca79e97090d8022e6b80078eb8
|
|
| BLAKE2b-256 |
5dbcbbda6befb69d0496b273cc1b4d155d62ae9c58fec50da5ea7a79ec0be3ed
|