Skip to main content

A pragmatic multi-language code parser optimized for LLM applications

Project description

Code Chunker

A pragmatic multi-language code parser optimized for LLM applications and RAG systems.

Features

  • Multi-language support: Python, JavaScript, TypeScript, Solidity, Go, Rust
  • Optimized for LLMs: Provides structured output ideal for language models
  • Lightweight: Minimal dependencies, fast parsing
  • Configurable: Adjust chunk sizes, confidence thresholds, and more
  • Easy to use: Simple API with both file and directory parsing
  • Incremental parsing: Efficiently update parse results when code changes
  • Enhanced language support:
    • TypeScript/React: Component, Hook, and Context detection
    • Solidity: Smart contract metadata extraction (visibility, modifiers, payable)
    • Go: Concurrency pattern detection (goroutines, channels, mutexes)

Installation

pip install code-chunker

Quick Start

from code_chunker import CodeChunker

# Initialize the chunker
chunker = CodeChunker()

# Parse a code string
code = """
def hello_world():
    print("Hello, World!")
"""

result = chunker.parse(code, language='python')

# Print the chunks
for chunk in result.chunks:
    print(f"{chunk.type.value}: {chunk.name} (lines {chunk.start_line}-{chunk.end_line})")

# Parse a file
result = chunker.parse_file('example.py')

# Parse a directory
results = chunker.parse_directory('src/')

Configuration

from code_chunker import CodeChunker, ChunkerConfig

config = ChunkerConfig(
    max_chunk_size=2000,
    min_chunk_size=100,
    include_comments=True,
    confidence_threshold=0.8
)

chunker = CodeChunker(config=config)

Incremental Parsing

Incremental parsing allows you to efficiently update parse results when code changes, without reparsing the entire file.

from code_chunker import CodeChunker, IncrementalParser

# Initialize the incremental parser
incremental_parser = IncrementalParser()

# First parse (full parse)
result1 = incremental_parser.full_parse("path/to/file.py")

# After file changes, perform an incremental parse
result2 = incremental_parser.incremental_parse("path/to/file.py")

# Compare the results
print(f"Full parse chunks: {len(result1.chunks)}")
print(f"Incremental parse chunks: {len(result2.chunks)}")

Enhanced Language Support

TypeScript/React Support

Code Chunker provides specialized support for React components, hooks, and contexts:

from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case

# Get React-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('typescript', 'react'))
chunker = CodeChunker(config=config)

# Parse React component
result = chunker.parse(react_code, language='typescript')

# Filter for React components
components = [chunk for chunk in result.chunks if chunk.type.value == 'component']
for component in components:
    print(f"Component: {component.name} (type: {component.metadata.get('component_type')})")

Solidity Smart Contract Support

Enhanced metadata extraction for smart contracts:

from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case

# Get Solidity-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('solidity', 'contract'))
chunker = CodeChunker(config=config)

# Parse Solidity contract
result = chunker.parse(contract_code, language='solidity')

# Find payable functions
payable_functions = [
    chunk for chunk in result.chunks 
    if chunk.type.value == 'function' and chunk.metadata.get('is_payable', False)
]

Go Concurrency Pattern Detection

Automatically detect concurrency patterns in Go code:

from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case

# Get Go-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('go', 'performance'))
chunker = CodeChunker(config=config)

# Parse Go code
result = chunker.parse(go_code, language='go')

# Find functions with goroutines
concurrent_funcs = [
    chunk for chunk in result.chunks 
    if chunk.type.value in ['function', 'method'] 
    and 'goroutines' in chunk.metadata.get('concurrency_patterns', {})
]

Supported Languages

  • Python (.py)
  • JavaScript (.js, .jsx)
  • TypeScript (.ts, .tsx)
  • Solidity (.sol)
  • Go (.go)
  • Rust (.rs)

Examples

The examples/ directory contains several examples demonstrating different features:

Basic Usage

Simple parsing examples:

python examples/basic_usage.py

Advanced Usage

Custom configuration and analysis:

python examples/advanced_usage.py

Incremental Parsing

Efficient parsing of code changes:

python examples/incremental_parsing.py

RAG Integration

Integration with RAG systems:

python examples/rag_integration.py

Edge Cases

Testing various edge cases across languages:

python examples/edge_cases.py

Performance Analysis

Analyze parsing performance:

python examples/performance_analysis.py

Code Quality Analysis

Analyze code quality metrics:

python examples/quality_analysis.py <file_path>

Visualization

Generate code structure visualization:

python examples/visualization.py <file_path>

API Reference

CodeChunker

The main class for parsing code.

chunker = CodeChunker(config=None)

Methods

  • parse(code: str, language: str) -> ParseResult: Parse a code string
  • parse_file(file_path: Union[str, Path]) -> ParseResult: Parse a file
  • parse_directory(directory: Union[str, Path], recursive: bool = True, extensions: Optional[List[str]] = None) -> List[ParseResult]: Parse a directory

IncrementalParser

For efficient incremental parsing.

parser = IncrementalParser(chunker=None)

Methods

  • full_parse(file_path: str) -> ParseResult: Perform a full parse and cache the result
  • parse_incremental(file_path: str, changes: List[Tuple[int, int, str]]) -> ParseResult: Parse incrementally based on changes
  • invalidate_cache(file_path: Optional[str] = None) -> None: Invalidate cache for a file or all files

How Incremental Parsing Works

  1. Initial Parse: The first parse of a file is a full parse, which is cached
  2. Change Detection: When changes are made, only affected code regions are identified
  3. Selective Reparsing: Only affected chunks are reparsed, preserving the rest
  4. Result Merging: Updated chunks are merged with unchanged chunks
  5. Smart Caching: Results are cached for future incremental updates

ParseResult

The result of parsing code.

Attributes

  • language: str: The programming language
  • file_path: Optional[str]: Path to the source file
  • chunks: List[CodeChunk]: List of code chunks
  • imports: List[Import]: List of imports
  • exports: List[str]: List of exports
  • raw_code: str: The original code

CodeChunk

Represents a piece of code.

Attributes

  • type: ChunkType: The type of chunk (function, class, etc.)
  • name: Optional[str]: The name of the chunk
  • code: str: The actual code
  • start_line: int: Starting line number
  • end_line: int: Ending line number
  • language: str: Programming language
  • confidence: float: Confidence score (0-1)
  • metadata: Dict[str, Any]: Additional metadata

Dependencies

  • For basic usage: No external dependencies
  • For performance analysis: psutil
  • For visualization: Modern web browser to view generated HTML

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

  1. Clone the repository
  2. Install development dependencies:
    pip install -e ".[dev]"
    
  3. Run tests:
    pytest
    
  4. Format code:
    black code_chunker/
    

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

If you find this project helpful, consider supporting its development:

  • ⭐ Star this repository
  • 🐛 Report bugs and suggest features
  • 🤝 Submit pull requests
  • 💰 EVM(ETH, ARB, BNB, OP..etc): 0x8f74959530dba14394b27faac92955aa96927e8b

Acknowledgments

Thanks to all contributors and the open-source community for their support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_chunker-1.3.1.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

code_chunker-1.3.1-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file code_chunker-1.3.1.tar.gz.

File metadata

  • Download URL: code_chunker-1.3.1.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for code_chunker-1.3.1.tar.gz
Algorithm Hash digest
SHA256 a7156fe98be8e0d0e6a9c4e9d5bceaf3a39c1541a073617d519408b49f8b96f1
MD5 d0f6f8b3d7be04d90279414244165d1d
BLAKE2b-256 8ad72ddafd3bac07a3cddf98b0eb4d9c36217fe27fc61086ee8208cd22d1f0c1

See more details on using hashes here.

File details

Details for the file code_chunker-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: code_chunker-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for code_chunker-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b6e6c1162784242e533b08fb44dd345187c99e683274b283f4a5a2a235f80072
MD5 af9bde51e507e74ed26ba2bf90cb3844
BLAKE2b-256 15cb2474b6c1cc75742b4538fcce94b322f571f02fcec0c2d9de351c05f61711

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page