Skip to main content

A Python-based parallel file chunking system for large codebases

Project description

KOMODO Logo

A Python-based parallel file chunking system designed for processing large codebases into LLM-friendly chunks. The tool provides intelligent file filtering, multi-threaded processing, and advanced chunking capabilities optimized for machine learning contexts.

Core Features

  • NEW Front end tool for chunking. Run komodo --dashboard

  • Parallel Processing: Multi-threaded file reading with configurable thread pools

  • Smart File Filtering:

    • Built-in patterns for common excludes (.git, node_modules, pycache, etc.)
    • Customizable ignore/unignore patterns
    • Intelligent binary file detection
  • Flexible Chunking:

    • Equal-parts chunking: Split content into N equal chunks
    • Size-based chunking: Split by maximum chunk size
    • Semantic (AST-based) chunking for Python files
    • Dry-run mode: If you only want to see which files would be chunked
    • Token based chunking: Split by tokens for LLMs
  • LLM Optimizations:

    • Metadata extraction (functions, classes, imports, docstrings)
    • Content relevance scoring
    • Redundancy removal across chunks
    • Configurable context window sizes
  • Chunking PDF Files:

    • Split PDF content by pages and paragraphs (rather than lines)
    • Perform basic text cleanup to handle multi-column layouts, or text from HTML-like elements if present
    • Create multiple chunks for large PDFs while preserving some logical structure
  • We scan your repos for api keys and automatically redact it. .env files are also ignored

Installation

pip install komodo==0.2.5

Link to pypi: https://pypi.org/project/pykomodo/

Quick Start

Command Line Usage

Here’s a complete list of available command-line options for the komodo tool:

Option Description Default Value
--dashboard Launches the front end for chunking N/A
--version Show the version of komodo N/A
dirs Directories to process (space-separated; e.g., komodo dir1/ dir2/). Current directory (.)
--equal-chunks N Split content into N equal chunks. Mutually exclusive with --max-chunk-size. None
--max-chunk-size M Maximum size per chunk (tokens without --semantic-chunks; lines for .py with it). None
--max-tokens N Maximum tokens per chunk (uses token-based chunking). None
--output-dir DIR Directory where chunk files are saved. "chunks"
--ignore PATTERN Add a pattern to ignore (repeatable, e.g., --ignore "*.log"). None
--unignore PATTERN Add a pattern to unignore (repeatable, overrides ignores). None
--dry-run List files that would be processed without creating chunks. False
--priority PATTERN,SCORE Set priority for file patterns (repeatable, e.g., --priority "*.py,10"). None
--num-threads N Number of threads for parallel processing. 4
--enhanced Use EnhancedParallelChunker for LLM optimizations. False
--semantic-chunks Enable AST-based chunking for .py files (splits by functions/classes). False
--context-window N Target LLM context window size in bytes (used with --enhanced). 4096
--min-relevance F Minimum relevance score for chunks (0.0-1.0, used with --enhanced). 0.3
--no-metadata Disable metadata extraction (used with --enhanced). False (metadata enabled)
--keep-redundant Keep redundant content across chunks (used with --enhanced). False (removes redundancy)
--no-summaries Disable summary generation (used with --enhanced; currently a placeholder in code). False (summaries enabled)
--file-type TYPE Only process files of this extension (e.g., pdf, py). None

Notes:

  • Options like --equal-chunks and --max-chunk-size cannot be used together (enforced by the CLI).
  • Use --dry-run to test your ignore/unignore patterns or priority rules without generating output.

Basic usage

CLI
# Split into 5 equal chunks
komodo . --equal-chunks 5

# Process multiple directories
komodo path1/ path2/ --max-chunk-size 1000

Chunking Modes

Komodo offers flexible chunking strategies, with behavior varying based on options and the chunker type (ParallelChunker or EnhancedParallelChunker with --enhanced).

  • Fixed Number of Chunks (--equal-chunks N):

    • Base Chunker: Keeps files whole, distributing them into N chunks with approximately equal total character counts. i.e. 5 different chunks or 5 text files.

      komodo . --equal-chunks 5 --output-dir chunks
      
    • Enhanced Chunker: Combines all file contents into one text blob, then splits into N chunks of roughly equal byte size, potentially splitting files mid-content.

      komodo . --equal-chunks 5 --enhanced
      
  • Fixed Size Chunks (--max-chunk-size M): Without --semantic-chunks: Splits each file into chunks with at most M tokens (words), keeping lines whole. i.e. x number of chunks with 2000 tokens each or 5000 tokens each etc.

    komodo . --max-chunk-size 2000
    

    Important: You must specify either --equal-chunks or --max-chunk-size, but not both.

  • With --semantic-chunks:

  • For .py files: Aims for chunks of M lines, grouping top-level functions/classes as atomic units. If a function exceeds M lines, it becomes a single chunk.

  • For non-.py files: Still splits by M tokens.

    komodo . --max-chunk-size 200 --semantic-chunks
    
  • With --max-tokens:

    komodo . --max-tokens 1000 --output-dir chunks
    
  • Precise token limits: Chunks content based on token counts rather than line counts
  • Tiktoken integration: Uses OpenAI's tiktoken library when available for accurate LLM token counting
  • Fallback tokenization: Falls back to word-based splitting when tiktoken is unavailable
  • PDF Chunking:

    Uses PyMuPDF to split PDFs by pages and paragraphs, respecting --max-chunk-size in tokens.

    komodo . --max-chunk-size 500 /path/to/output --file-type pdf
    

    or

    komodo . --equal-chunks 10 --output-dir /path/to/output --file-type pdf
    

    IMPORTANT: Do note that for PDFs with a lot of images, this PDF chunker will NOT WORK. This current PDF chunker is NOT capable of chunking formulas/images

Ignoring & Unignoring Files

  • Add ignore patterns with --ignore.

  • Unignore specific patterns with --unignore.

  • Komodo also has built-in ignores like .git, pycache, node_modules, etc.

    # Skip everything in "results/" (relative) and "docs/" (relative)
    komodo . --equal-chunks 5 \
      --ignore "results/**" \
      --ignore "docs/**"
    
    # Skip an absolute path
    komodo . --equal-chunks 5 \
      --ignore "/Users/oha/komodo/results/**"
    
    # Skip all .rst files, but unignore README.rst
    komodo . --equal-chunks 5 \
      --ignore "*.rst" \
      --unignore "README.rst"
    

    Note: If node_modules fails to be ignored, run this command instead komodo . --equal-chunks 5 --file-type js --ignore "**/node_modules/**". The key here is that you are specifying the file type.

    Safest (Recursive) Ignoring

    If you want to ensure that Komodo skips all files inside a particular directory (including all subfolders), you can use the ** wildcard before and after the folder name:

    # safest mode: skip everything in "results/" and "docs/" recursively
    komodo . --equal-chunks 5 \
      --ignore "**/results/**" \
      --ignore "**/docs/**"
    

    Pro Tip: If in doubt, just use /folder/ to recursively ignore that folder and everything beneath it. This is the most reliable way to avoid processing unwanted files in subdirectories.

    Fixed Number of Chunks with ignore mode
    • --ignore "/Users/oha/treeline/results/**" tells the chunker to skip any files in that absolute directory path.

    • --ignore "docs/*" tells it to skip any files under a relative folder named docs/.

      komodo . --equal-chunks 5 --ignore "/Users/oha/treeline/results/**" --ignore "docs/*" 
      
    Priority Rules

    Priority Rules help determine which files should be processed first or given more importance. Files with higher priority scores are processed first

    # With equal chunks, 10 which is .py is higher than 5, so 10 will get processed first
    komodo . \
      --equal-chunks 5 \
      --priority "*.py,10" \ 
      --priority "*.md,5" \
      --output-dir chunks
    
    # Or with max chunk size
    komodo . \
      --max-chunk-size 1000 \
      --priority "*.py,10" \
      --priority "*.md,5" \
      --output-dir chunks
    

LLM Optimization Options

Enable metadata extraction and content optimization:

komodo . \
  --equal-chunks 5 \
  --enhanced \
  --context-window 4096 \
  --min-relevance 0.3
komodo . \
  --equal-chunks 5 \
  --enhanced \
  --keep-redundant \
  --min-relevance 0.5
komodo . \
  --equal-chunks 5 \
  --enhanced \
  --no-metadata \
  --context-window 8192

Dry Run

If you only want to see which files would be chunked (and in what priority order), without actually writing any output chunks, you can specify --dry-run. This is especially helpful if you’re testing new ignore/unignore patterns or priority rules. Note again, there will be NO CHUNKING being done. This is just to let you see what files will be chunked.

Example:

## vanilla approach 
komodo . --equal-chunks 5 --dry-run

## with priorities for .py files. these get processed faster. but note this is just a dry run
komodo . --equal-chunks 5 --dry-run \
    --priority "*.py,10" \
    --priority "*.md,5"

No chunks are created. Komodo simply prints the would-be processed files, sorted by priority. This is an easy way to confirm your ignore patterns and see exactly which files the chunker will pick up.

Python API Usage

Basic usage:

from komodo import ParallelChunker

# Split into 5 equal chunks
chunker = ParallelChunker(
    equal_chunks=5,
    output_dir="chunks"
)
chunker.process_directory("path/to/code")

Advanced configuration:

chunker = ParallelChunker(
    equal_chunks=5,  # or max_chunk_size=1000
    
    user_ignore=["*.log", "node_modules/**"],
    user_unignore=["important.log"],
    binary_extensions=["exe", "dll", "so", "bin"],
    
    priority_rules=[
        ("*.py", 10),
        ("*.md", 5),
        ("*.txt", 1)
    ],
    
    output_dir="chunks",
    num_threads=4
)

chunker.process_directories(["src/", "docs/", "tests/"])

Basic configuration with file_type:

import os
from pykomodo.multi_dirs_chunker import ParallelChunker

os.makedirs("/Users/test/komodo/pdf", exist_ok=True)
output_dir = "/Users/test/komodo/pdf"

chunker = ParallelChunker(
    max_chunk_size=1000,
    output_dir=output_dir,
    file_type="pdf" 
)

chunker.process_directory("/Users/test/komodo/")

print("PDF processing completed successfully!")

Front-end Usage

To run the front end for chunking, just use komodo --dashboard

Advanced LLM Features

Metadata Extraction

Each chunk automatically extracts and includes:

  • Function definitions
  • Class declarations
  • Import statements
  • Docstrings

Relevance Scoring

Chunks are scored based on:

  • Code/comment ratio
  • Function/class density
  • Documentation quality
  • Import significance

Redundancy Removal

Automatically removes duplicate content across chunks while preserving unique context.

Example with LLM optimizations:

chunker = ParallelChunker(
    equal_chunks=5,
    extract_metadata=True,
    remove_redundancy=True,
    context_window=4096,
    min_relevance_score=0.3
)

File Type Restriction

The file_type parameter of the ParallelChunker constructor lets you restrict which file extensions you process.

import os
from pykomodo.multi_dirs_chunker import ParallelChunker

os.makedirs("/path/to/dir", exist_ok=True)
output_dir = "/path/to/dir"

chunker = ParallelChunker(
    max_chunk_size=1000,
    output_dir=output_dir,
    file_type="pdf" 
)

chunker.process_directory("/path/to/dir")

print("PDF processing completed successfully!")

Typed Classes & Pydantic-Based Configuration

Komodo’s main classes (ParallelChunker, EnhancedParallelChunker, etc.) now include type hints. Nothing changes at runtime, but if you’re using an IDE or a type checker like mypy, you’ll get improved error checking and auto-completion - or hopefully.

You can also use Pydantic to configure Komodo with strongly typed settings. For instance:

from pydantic import BaseModel, Field
from typing import List, Optional
from pykomodo.multi_dirs_chunker import ParallelChunker
from pykomodo.enhanced_chunker import EnhancedParallelChunker

class KomodoConfig(BaseModel):
    directories: List[str] = Field(default_factory=lambda: ["."], description="Directories to process.")
    equal_chunks: Optional[int] = None
    max_chunk_size: Optional[int] = None
    output_dir: str = "chunks"
    semantic_chunking: bool = False
    enhanced: bool = False
    context_window: int = 4096
    min_relevance_score: float = 0.3
    remove_redundancy: bool = True
    extract_metadata: bool = True

def run_chunker_with_config(config: KomodoConfig):
    ChunkerClass = EnhancedParallelChunker if config.enhanced else ParallelChunker

    chunker = ChunkerClass(
        equal_chunks=config.equal_chunks,
        max_chunk_size=config.max_chunk_size,
        output_dir=config.output_dir,
        semantic_chunking=config.semantic_chunking,
        context_window=config.context_window if config.enhanced else None,
        min_relevance_score=config.min_relevance_score if config.enhanced else None,
        remove_redundancy=config.remove_redundancy if config.enhanced else None,
        extract_metadata=config.extract_metadata if config.enhanced else None,
    )

    chunker.process_directories(config.directories)
    chunker.close()

if __name__ == "__main__":
    # example use with typed + validated config
    cfg = KomodoConfig(directories=["src/", "docs/"], equal_chunks=5, enhanced=True)
    run_chunker_with_config(cfg)

Common Use Cases

1. Preparing Context for LLMs

Split a large codebase into equal chunks suitable for LLM context windows:

chunker = ParallelChunker(
    equal_chunks=5,
    priority_rules=[
        ("*.py", 10),    
        ("README*", 8), 
    ],
    user_ignore=["tests/**", "**/__pycache__/**"],
    output_dir="llm_chunks"
)
chunker.process_directory("my_project")

Built-in Ignore Patterns

The chunker automatically ignores common non-text and build-related files:

  • **/.git/**
  • **/.idea/**
  • __pycache__
  • *.pyc
  • *.pyo
  • **/node_modules/**
  • target
  • venv

Common Gotchas

  1. Leading Slash for Absolute Paths
  • If you omit the leading / in a pattern like /Users/oha/..., Komodo treats it as relative and won’t match your actual absolute path.
  1. /** vs. /*
  • folder/** matches all files and subfolders under folder.
  • folder/* only matches the immediate contents of folder, not deeper subdirectories.
  • Overwriting Multiple --ignore Flags
  1. Folder Name vs. Actual Path
  • If your path is really src/komodo/content/results, but you only wrote results/**, you may need a double-star approach (**/results/**) to cover deeper paths.

Acknowledgments

This project was inspired by repomix, a repository content chunking tool.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pykomodo-0.2.5.tar.gz (31.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pykomodo-0.2.5-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file pykomodo-0.2.5.tar.gz.

File metadata

  • Download URL: pykomodo-0.2.5.tar.gz
  • Upload date:
  • Size: 31.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for pykomodo-0.2.5.tar.gz
Algorithm Hash digest
SHA256 4e7984de2976423b9690e9f6997e8bc90798ae277e1a80e960a8c5fab055f691
MD5 0e33ab97be235e38c52daa96ea0dc5dd
BLAKE2b-256 922c0576c11169a888233e190390d31a54229a3a302d469d7e61c973e5b252d5

See more details on using hashes here.

File details

Details for the file pykomodo-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: pykomodo-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for pykomodo-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 7d116adb7e9ce0568e4a0242aeb8b1481f5a4763ee4afee77cc0b2039b26e688
MD5 45e84884a2f63b916527b8c75a4ef953
BLAKE2b-256 6e605b4b700dcda5c1db72fc8555c45e8fab52980d79b361b3cd5e803cd40f6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page