Production-grade document chunking library for RAG systems - Rust-powered Python library
Project description
Krira Augment ⚡🦀
The High-Performance Rust Chunking Engine for RAG Pipelines
Krira Augment is a production-grade Python library backed by a highly optimized Rust core. It is designed to replace slow, memory-intensive preprocessing steps in large-scale Retrieval Augmented Generation (RAG) systems.
It processes gigabytes of raw unstructured data (CSV, PDF, DOCX, JSON, URLs, etc.) into high-quality, clean chunks in seconds—utilizing zero-copy memory mapping and segment-based parallel CPU execution.
🚀 Performance Benchmarks
Benchmarks run on a standard 8-core machine (M2 Air equivalent).
| Dataset Size | Legacy (LangChain/Pandas) | Krira V2 (Rust Core) | Speedup |
|---|---|---|---|
| 100 MB | ~45 sec | ~0.8 sec | 56x 🚀 |
| 1 GB | ~8.0 min | ~12.0 sec | 40x 🚀 |
| 5.28 GB | Crash / OOM | ~58.0 sec | Stable ✅ |
| 10 GB+ | N/A | ~2.1 min | Scalable ✅ |
Note: Krira uses a segment-based parallel strategy. It divides large files into 32MB chunks to ensure CPU saturation while maintaining a strict, low memory footprint.
📦 Installation
# Basic installation
pip install krira-augment
# Install with optional multi-format support
pip install "krira-augment[all]"
Requirements: Python 3.8+
🛠️ Usage
1. Quick Start
The process method is now fully flexible. If no output_path is provided, Krira automatically generates one based on the input filename.
from krira_augment import Pipeline
# Initialize the pipeline
pipeline = Pipeline()
# Process any file (CSV, JSONL, TXT, XML, etc.)
stats = pipeline.process(input_path="my_data.csv")
# Print the beautiful formatted output
print(stats)
Output:
============================================================
✅ KRIRA AUGMENT - Processing Complete
============================================================
📊 Chunks Created: 1,247
⏱️ Execution Time: 0.85 seconds
🚀 Throughput: 118.24 MB/s
📁 Output File: my_data_processed.jsonl
============================================================
📝 Preview (Top 3 Chunks):
------------------------------------------------------------
[1] This is the first chunk of processed text from your file...
[2] Here is the second chunk with more content from the data...
[3] And the third chunk showing a sample of the output...
------------------------------------------------------------
You can also access individual stats:
print(f"Chunks: {stats.chunks_created}")
print(f"Time: {stats.execution_time:.2f}s")
print(f"Output: {stats.output_file}")
print(f"Preview: {stats.preview_chunks}")
2. Multi-Format Support
Krira Augment handles the heavy lifting of extracting text from complex formats and passing it to the high-speed Rust core.
📄 CSV Files
Process CSV files directly. Each row is treated as a separate text unit for chunking.
from krira_augment import Pipeline
pipeline = Pipeline()
# Process CSV - output auto-generated as 'data_processed.jsonl'
stats = pipeline.process("data.csv")
print(f"Output: {stats.output_file}")
# Or specify custom output path
stats = pipeline.process("data.csv", output_path="chunked_data.jsonl")
📕 PDF Documents
Extract text from PDF files page by page. Requires: pip install pdfplumber
from krira_augment import Pipeline
pipeline = Pipeline()
# Process PDF - extracts text from all pages
stats = pipeline.process("document.pdf")
print(f"Output: {stats.output_file}")
# With custom output
stats = pipeline.process("report.pdf", output_path="report_chunks.jsonl")
📗 Excel Spreadsheets (.xlsx)
Process Excel files with automatic sheet and row handling. Requires: pip install openpyxl
from krira_augment import Pipeline
pipeline = Pipeline()
# Process Excel - each row becomes a text chunk
stats = pipeline.process("spreadsheet.xlsx")
print(f"Output: {stats.output_file}")
# With custom output
stats = pipeline.process("data.xlsx", output_path="excel_chunks.jsonl")
📘 Word Documents (.docx)
Extract paragraphs from Word documents. Requires: pip install python-docx
from krira_augment import Pipeline
pipeline = Pipeline()
# Process DOCX - each paragraph becomes a text unit
stats = pipeline.process("document.docx")
print(f"Output: {stats.output_file}")
# With custom output
stats = pipeline.process("contract.docx", output_path="contract_chunks.jsonl")
🌐 Website URLs
Fetch and process web pages. Requires: pip install requests beautifulsoup4
from krira_augment import Pipeline
pipeline = Pipeline()
# Process URL - auto-generates output filename from URL hash
stats = pipeline.process("https://example.com/docs")
print(f"Output: {stats.output_file}")
# With custom output
stats = pipeline.process("https://example.com/article", output_path="article_chunks.jsonl")
📙 XML Files
Process XML files by extracting text from each child element.
from krira_augment import Pipeline
pipeline = Pipeline()
# Process XML - each child element text becomes a chunk
stats = pipeline.process("data.xml")
print(f"Output: {stats.output_file}")
# With custom output
stats = pipeline.process("config.xml", output_path="xml_chunks.jsonl")
📋 JSON Files
Process JSON arrays or objects by flattening to JSONL.
from krira_augment import Pipeline
pipeline = Pipeline()
# Process JSON - arrays are flattened, objects are chunked
stats = pipeline.process("data.json")
print(f"Output: {stats.output_file}")
# With custom output
stats = pipeline.process("config.json", output_path="json_chunks.jsonl")
📝 JSONL Files
Process JSONL files directly (native format for Rust core).
from krira_augment import Pipeline
pipeline = Pipeline()
# Process JSONL - direct pass-through to Rust core
stats = pipeline.process("data.jsonl")
print(f"Output: {stats.output_file}")
# With custom output
stats = pipeline.process("logs.jsonl", output_path="processed_logs.jsonl")
📃 Text Files (.txt)
Process plain text files line by line.
from krira_augment import Pipeline
pipeline = Pipeline()
# Process TXT - each line is processed
stats = pipeline.process("notes.txt")
print(f"Output: {stats.output_file}")
# With custom output
stats = pipeline.process("corpus.txt", output_path="corpus_chunks.jsonl")
3. Advanced Configuration (Professional)
For production RAG, you need fine-grained control over chunking strategies and data cleaning.
from krira_augment import Pipeline, PipelineConfig, SplitStrategy
# Define a robust configuration
config = PipelineConfig(
chunk_size=512, # Target characters per chunk
strategy=SplitStrategy.SMART, # Respects sentence/paragraph boundaries
clean_html=True, # Remove <div>, <br>, etc.
clean_unicode=True, # Normalize whitespace and emojis
)
pipeline = Pipeline(config=config)
# Execute
result = pipeline.process("large_corpus.csv", output_path="custom_output.jsonl")
# Beautiful formatted output
print(result)
# Or access individual stats
print(f"Chunks Created: {result.chunks_created}")
print(f"Execution Time: {result.execution_time:.2f}s")
print(f"Throughput: {result.mb_per_second:.2f} MB/s")
print(f"Preview: {result.preview_chunks[:2]}") # First 2 chunks
🏗️ Architecture
Krira differs from standard Python loaders by offloading the entire ETL process to a compiled Rust binary with industrial-strength safety.
- Memory Mapping (mmap): Files are mapped directly from disk. No loading massive files into Python RAM.
- Segmented Parallelism: The file is sliced into 32MB segments processed via the Rayon work-stealing scheduler.
- Bounded Backpressure: A 1024-item bounded MPSC channel manages data flow from processing threads to the disk writer, preventing runaway memory growth even if processing speed exceeds disk I/O.
- Serde Serialization: Chunks are serialized to JSONL directly on Rust threads, bypassing the Python GIL.
🤝 Integration Example
import json
def stream_chunks(jsonl_path):
with open(jsonl_path, 'r', encoding='utf-8') as f:
for line in f:
yield json.loads(line)
# Usage
for chunk in stream_chunks("my_data_processed.jsonl"):
# Send to Vector DB or OpenAI Embedding API
pass
🧑💻 Development
- Clone the repo
- Install Maturin
pip install maturin
- Build and Install locally
python -m build pip install dist/*.whl --force-reinstall
License
MIT License. (c) 2024 Krira Labs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file krira_augment-2.0.6.tar.gz.
File metadata
- Download URL: krira_augment-2.0.6.tar.gz
- Upload date:
- Size: 45.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab35377126235ca3afa15b1b17a3ead3ac1dc3d51b7057a36eb00564663959b1
|
|
| MD5 |
5b9770755c792eb5985587593b9ef404
|
|
| BLAKE2b-256 |
872eba2e8b9d5c7d3eac1c1f549d3c93fcf9583c4d5aab89a734d53ded733ec1
|
File details
Details for the file krira_augment-2.0.6-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: krira_augment-2.0.6-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 684.8 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fccff10bd9f3d50345efe46c7e82485c86ac8949adbdd72c97f2b5273c3cfc2d
|
|
| MD5 |
3e8ba42c9f5022f9a8a28ee8ae8570d3
|
|
| BLAKE2b-256 |
ab57b8453d91b9f70fe52cf98ee82fc8ac70a48b6162e96cdce25da604f47a50
|