Production-grade document chunking library for RAG systems - Rust-powered Python library
Project description
Krira Augment ⚡🦀
The High-Performance Rust Chunking Engine for RAG Pipelines
Krira Augment is a production-grade Python library backed by a highly optimized Rust core. It is designed to replace slow, memory-intensive preprocessing steps in large-scale Retrieval Augmented Generation (RAG) systems.
It processes gigabytes of raw unstructured data (CSV, JSONL, TXT) into high-quality, clean chunks in seconds—utilizing zero-copy memory mapping and parallel CPU execution.
🚀 Performance Benchmarks
Benchmarks run on a standard 8-core machine (M2 Air equivalent).
| Dataset Size | Legacy (LangChain/Pandas) | Krira V2 (Rust Core) | Speedup |
|---|---|---|---|
| 100 MB | ~45 sec | ~0.8 sec | 56x 🚀 |
| 1 GB | ~8.0 min | ~12.0 sec | 40x 🚀 |
| 10 GB | Timeout / OOM | ~2.1 min | Stable ✅ |
Note: Krira uses O(1) memory. Processing a 100GB file uses the same amount of RAM as a 10MB file.
📦 Installation
pip install krira-augment
Requirements: Python 3.8+
🛠️ Usage
1. Quick Start
For standard use cases, use the default high-throughput pipeline.
from krira_augment import Pipeline
# Initialize the pipeline
pipeline = Pipeline()
# Process a 1GB file in seconds
stats = pipeline.process(
input_path="data/raw_knowledge_base.csv",
output_path="data/processed_chunks.jsonl"
)
print(f"✅ Processing complete chunking job.")
2. Advanced Configuration (Professional)
For production RAG, you need fine-grained control over chunking strategies, overlap, and data cleaning.
from krira_augment import Pipeline, PipelineConfig, SplitStrategy
# Define a robust configuration
config = PipelineConfig(
# Chunking Strategy
chunk_size=512, # Target characters per chunk
chunk_overlap=50, # Context overlap for better retrieval
strategy=SplitStrategy.SMART, # Respects sentence/paragraph boundaries
# Data Cleaning Rules (Rust-native regex)
clean_html=True, # Remove <div>, <br>, etc.
clean_unicode=True, # Normalize whitespace and emojis
min_chunk_len=20, # Discard garbage/empty chunks
# System Performance
threads=8, # Force usage of 8 CPU cores
batch_size=1000 # Write to disk every 1k chunks (Low RAM usage)
)
# Initialize with config
pipeline = Pipeline(config=config)
# Execute
result = pipeline.process(
input_path="large_corpus.csv",
output_path="corpus_vectors.jsonl"
)
print(f"Job ID: {result.job_id}")
print(f"Throughput: {result.mb_per_second:.2f} MB/s")
📄 Output Format
The library outputs standard JSONL (JSON Lines), ready for direct ingestion into vector databases (Pinecone, Weaviate, Qdrant).
processed_chunks.jsonl:
{"text": "The mitochondria is the powerhouse...", "metadata": {"source": "doc1.csv", "row": 1, "chunk_index": 0}}
{"text": "It generates most of the chemical energy...", "metadata": {"source": "doc1.csv", "row": 1, "chunk_index": 1}}
🏗️ Architecture
Krira differs from standard Python loaders by offloading the entire ETL process to a compiled Rust binary.
- Memory Mapping (mmap): The file is mapped directly from disk to virtual memory. No loading 1GB CSVs into Python RAM.
- Rayon Parallelism: The file is sliced into segments and processed across all available CPU cores simultaneously.
- Serde Serialization: Chunks are serialized to JSONL directly on the Rust thread, minimizing Python GIL interaction.
🤝 Integration Example
Seamlessly integrate with generic Python generators to feed embeddings.
import json
import openai
def stream_chunks(jsonl_path):
"""Yields chunks efficiently for embedding API calls."""
with open(jsonl_path, 'r') as f:
for line in f:
yield json.loads(line)
# Use in your downstream application
for chunk in stream_chunks("processed_chunks.jsonl"):
# Mock embedding call
# embedding = openai.Embedding.create(input=chunk['text'])
pass
# Upsert to Vector DB (e.g., Pinecone)
# index.upsert(vectors=[(chunk['id'], embedding, chunk['metadata'])])
🧑💻 Development
If you want to modify the Rust core:
- Clone the repo
- Install Maturin (Rust-Python bridge builder)
pip install maturin
- Build and Install locally
maturin develop --release
License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file krira_augment-2.0.2.tar.gz.
File metadata
- Download URL: krira_augment-2.0.2.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a57a016d890915707c70206be7540cf827cf189fb87635483c6820cef94f2a4
|
|
| MD5 |
1755bdf6511ae5fe52647e91d74adea7
|
|
| BLAKE2b-256 |
1ca892035a5a77345c1c452e71c6b2ed1198a9abde944a5040c09344be464e0d
|
File details
Details for the file krira_augment-2.0.2-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: krira_augment-2.0.2-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 672.0 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
153ee53fda459b35fa3eca6113101b307e3ca61c5a4a7faad65ed6e8f46f4c77
|
|
| MD5 |
920c44f03aa11e3e1c387e99aef23971
|
|
| BLAKE2b-256 |
fa4edf55a0288993c6c9d8aeb6e495d660b86e193fceb688e1f36567e9e3a235
|