Skip to main content

Production-grade document chunking library for RAG systems - Rust-powered Python library

Project description

Krira Augment ⚡🦀

The High-Performance Rust Chunking Engine for RAG Pipelines

PyPI version License: MIT Rust

Krira Augment is a production-grade Python library backed by a highly optimized Rust core. It is designed to replace slow, memory-intensive preprocessing steps in large-scale Retrieval Augmented Generation (RAG) systems.

It processes gigabytes of raw unstructured data (CSV, PDF, DOCX, JSON, URLs, etc.) into high-quality, clean chunks in seconds—utilizing zero-copy memory mapping and segment-based parallel CPU execution.


🚀 Performance Benchmarks

Benchmarks run on a standard 8-core machine (M2 Air equivalent).

Dataset Size Legacy (LangChain/Pandas) Krira V2 (Rust Core) Speedup
100 MB ~45 sec ~0.8 sec 56x 🚀
1 GB ~8.0 min ~12.0 sec 40x 🚀
5.28 GB Crash / OOM ~58.0 sec Stable
10 GB+ N/A ~2.1 min Scalable

Note: Krira uses a segment-based parallel strategy. It divides large files into 32MB chunks to ensure CPU saturation while maintaining a strict, low memory footprint.


📦 Installation

# Basic installation
pip install krira-augment

# Install with optional multi-format support
pip install "krira-augment[all]"

Requirements: Python 3.8+


🛠️ Usage

1. Quick Start

The process method is now fully flexible. If no output_path is provided, Krira automatically generates one based on the input filename.

from krira_augment import Pipeline

# Initialize the pipeline
pipeline = Pipeline()

# Process any file (CSV, JSONL, TXT, XML, etc.)
# Logic: If no output_path is provided, results go to 'my_data_processed.jsonl'
stats = pipeline.process(input_path="my_data.csv")

print(f"✅ Processing complete!")
print(f"Output saved to: {stats.output_file}")
print(f"Throughput: {stats.mb_per_second:.2f} MB/s")

2. Multi-Format Support

Krira Augment handles the heavy lifting of extracting text from complex formats and passing it to the high-speed Rust core.

pipeline = Pipeline()

# Process a Website URL
pipeline.process("https://example.com/docs")

# Process a PDF Document
pipeline.process("internal_report.pdf")

# Process an Excel Spreadsheet or DOCX
pipeline.process("user_feedback.xlsx")
pipeline.process("contract.docx")

3. Advanced Configuration (Professional)

For production RAG, you need fine-grained control over chunking strategies and data cleaning.

from krira_augment import Pipeline, PipelineConfig, SplitStrategy

# Define a robust configuration
config = PipelineConfig(
    chunk_size=512,               # Target characters per chunk
    strategy=SplitStrategy.SMART, # Respects sentence/paragraph boundaries
    clean_html=True,              # Remove <div>, <br>, etc.
    clean_unicode=True,           # Normalize whitespace and emojis
)

pipeline = Pipeline(config=config)

# Execute
result = pipeline.process("large_corpus.csv", output_path="custom_output.jsonl")

print(f"Chunks Created: {result.chunks_created}") # -1 if streaming unknown

🏗️ Architecture

Krira differs from standard Python loaders by offloading the entire ETL process to a compiled Rust binary with industrial-strength safety.

  1. Memory Mapping (mmap): Files are mapped directly from disk. No loading massive files into Python RAM.
  2. Segmented Parallelism: The file is sliced into 32MB segments processed via the Rayon work-stealing scheduler.
  3. Bounded Backpressure: A 1024-item bounded MPSC channel manages data flow from processing threads to the disk writer, preventing runaway memory growth even if processing speed exceeds disk I/O.
  4. Serde Serialization: Chunks are serialized to JSONL directly on Rust threads, bypassing the Python GIL.

🤝 Integration Example

import json

def stream_chunks(jsonl_path):
    with open(jsonl_path, 'r', encoding='utf-8') as f:
        for line in f:
            yield json.loads(line)

# Usage
for chunk in stream_chunks("my_data_processed.jsonl"):
    # Send to Vector DB or OpenAI Embedding API
    pass

🧑‍💻 Development

  1. Clone the repo
  2. Install Maturin
    pip install maturin
    
  3. Build and Install locally
    python -m build
    pip install dist/*.whl --force-reinstall
    

License

MIT License. (c) 2024 Krira Labs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krira_augment-2.0.5.tar.gz (43.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krira_augment-2.0.5-cp313-cp313-win_amd64.whl (683.2 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file krira_augment-2.0.5.tar.gz.

File metadata

  • Download URL: krira_augment-2.0.5.tar.gz
  • Upload date:
  • Size: 43.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for krira_augment-2.0.5.tar.gz
Algorithm Hash digest
SHA256 c1657af1fec1a47b23925f6393379c6e3d907d84c9a93f5a4e541d4bcaee5883
MD5 acfdcab9bed9fbafb634acf92db5f380
BLAKE2b-256 fd658f6c35a9d78b97140f164e006b99caa9fa34d9abe9548ea2e3c88eeb4027

See more details on using hashes here.

File details

Details for the file krira_augment-2.0.5-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for krira_augment-2.0.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 cc10e95cc25a8790d93916a5a473162ae5c0b12f488041dabbe4061106a2e93e
MD5 ff964645999853703e319a1f80a3cd8f
BLAKE2b-256 c0197573e3eb8d18e631b4ff817b8e2c400205529fd1cdb71b64e83d4aef20a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page