Production-grade document chunking library for RAG systems - Rust-powered Python library
Project description
Krira Augment ⚡🦀
The High-Performance Rust Chunking Engine for RAG Pipelines
Krira Augment is a production-grade Python library backed by a highly optimized Rust core. It is designed to replace slow, memory-intensive preprocessing steps in large-scale Retrieval Augmented Generation (RAG) systems.
It processes gigabytes of raw unstructured data (CSV, PDF, DOCX, JSON, URLs, etc.) into high-quality, clean chunks in seconds—utilizing zero-copy memory mapping and segment-based parallel CPU execution.
🚀 Performance Benchmarks
Benchmarks run on a standard 8-core machine (M2 Air equivalent).
| Dataset Size | Legacy (LangChain/Pandas) | Krira V2 (Rust Core) | Speedup |
|---|---|---|---|
| 100 MB | ~45 sec | ~0.8 sec | 56x 🚀 |
| 1 GB | ~8.0 min | ~12.0 sec | 40x 🚀 |
| 5.28 GB | Crash / OOM | ~58.0 sec | Stable ✅ |
| 10 GB+ | N/A | ~2.1 min | Scalable ✅ |
Note: Krira uses a segment-based parallel strategy. It divides large files into 32MB chunks to ensure CPU saturation while maintaining a strict, low memory footprint.
📦 Installation
# Basic installation
pip install krira-augment
# Install with optional multi-format support
pip install "krira-augment[all]"
Requirements: Python 3.8+
🛠️ Usage
1. Quick Start
The process method is now fully flexible. If no output_path is provided, Krira automatically generates one based on the input filename.
from krira_augment import Pipeline
# Initialize the pipeline
pipeline = Pipeline()
# Process any file (CSV, JSONL, TXT, XML, etc.)
# Logic: If no output_path is provided, results go to 'my_data_processed.jsonl'
stats = pipeline.process(input_path="my_data.csv")
print(f"✅ Processing complete!")
print(f"Output saved to: {stats.output_file}")
print(f"Throughput: {stats.mb_per_second:.2f} MB/s")
2. Multi-Format Support
Krira Augment handles the heavy lifting of extracting text from complex formats and passing it to the high-speed Rust core.
pipeline = Pipeline()
# Process a Website URL
pipeline.process("https://example.com/docs")
# Process a PDF Document
pipeline.process("internal_report.pdf")
# Process an Excel Spreadsheet or DOCX
pipeline.process("user_feedback.xlsx")
pipeline.process("contract.docx")
3. Advanced Configuration (Professional)
For production RAG, you need fine-grained control over chunking strategies and data cleaning.
from krira_augment import Pipeline, PipelineConfig, SplitStrategy
# Define a robust configuration
config = PipelineConfig(
chunk_size=512, # Target characters per chunk
strategy=SplitStrategy.SMART, # Respects sentence/paragraph boundaries
clean_html=True, # Remove <div>, <br>, etc.
clean_unicode=True, # Normalize whitespace and emojis
)
pipeline = Pipeline(config=config)
# Execute
result = pipeline.process("large_corpus.csv", output_path="custom_output.jsonl")
print(f"Chunks Created: {result.chunks_created}") # -1 if streaming unknown
🏗️ Architecture
Krira differs from standard Python loaders by offloading the entire ETL process to a compiled Rust binary with industrial-strength safety.
- Memory Mapping (mmap): Files are mapped directly from disk. No loading massive files into Python RAM.
- Segmented Parallelism: The file is sliced into 32MB segments processed via the Rayon work-stealing scheduler.
- Bounded Backpressure: A 1024-item bounded MPSC channel manages data flow from processing threads to the disk writer, preventing runaway memory growth even if processing speed exceeds disk I/O.
- Serde Serialization: Chunks are serialized to JSONL directly on Rust threads, bypassing the Python GIL.
🤝 Integration Example
import json
def stream_chunks(jsonl_path):
with open(jsonl_path, 'r', encoding='utf-8') as f:
for line in f:
yield json.loads(line)
# Usage
for chunk in stream_chunks("my_data_processed.jsonl"):
# Send to Vector DB or OpenAI Embedding API
pass
🧑💻 Development
- Clone the repo
- Install Maturin
pip install maturin
- Build and Install locally
python -m build pip install dist/*.whl --force-reinstall
License
MIT License. (c) 2024 Krira Labs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file krira_augment-2.0.5.tar.gz.
File metadata
- Download URL: krira_augment-2.0.5.tar.gz
- Upload date:
- Size: 43.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1657af1fec1a47b23925f6393379c6e3d907d84c9a93f5a4e541d4bcaee5883
|
|
| MD5 |
acfdcab9bed9fbafb634acf92db5f380
|
|
| BLAKE2b-256 |
fd658f6c35a9d78b97140f164e006b99caa9fa34d9abe9548ea2e3c88eeb4027
|
File details
Details for the file krira_augment-2.0.5-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: krira_augment-2.0.5-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 683.2 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc10e95cc25a8790d93916a5a473162ae5c0b12f488041dabbe4061106a2e93e
|
|
| MD5 |
ff964645999853703e319a1f80a3cd8f
|
|
| BLAKE2b-256 |
c0197573e3eb8d18e631b4ff817b8e2c400205529fd1cdb71b64e83d4aef20a5
|