Skip to main content

Production-grade document chunking library for RAG systems - Rust-powered Python library

Project description

Krira Augment ⚡🦀

The High-Performance Rust Chunking Engine for RAG Pipelines

PyPI version License: MIT Rust

Krira Augment is a production-grade Python library backed by a highly optimized Rust core. It is designed to replace slow, memory-intensive preprocessing steps in large-scale Retrieval Augmented Generation (RAG) systems.

It processes gigabytes of raw unstructured data (CSV, JSONL, TXT) into high-quality, clean chunks in seconds—utilizing zero-copy memory mapping and parallel CPU execution.


🚀 Performance Benchmarks

Benchmarks run on a standard 8-core machine (M2 Air equivalent).

Dataset Size Legacy (LangChain/Pandas) Krira V2 (Rust Core) Speedup
100 MB ~45 sec ~0.8 sec 56x 🚀
1 GB ~8.0 min ~12.0 sec 40x 🚀
10 GB Timeout / OOM ~2.1 min Stable

Note: Krira uses O(1) memory. Processing a 100GB file uses the same amount of RAM as a 10MB file.


📦 Installation

pip install krira-augment

Requirements: Python 3.8+


🛠️ Usage

1. Quick Start

For standard use cases, use the default high-throughput pipeline.

from krira_augment import Pipeline

# Initialize the pipeline
pipeline = Pipeline()

# Process a 1GB file in seconds
stats = pipeline.process(
    input_path="data/raw_knowledge_base.csv",
    output_path="data/processed_chunks.jsonl"
)

print(f"✅ Processing complete chunking job.")

2. Advanced Configuration (Professional)

For production RAG, you need fine-grained control over chunking strategies, overlap, and data cleaning.

from krira_augment import Pipeline, PipelineConfig, SplitStrategy

# Define a robust configuration
config = PipelineConfig(
    # Chunking Strategy
    chunk_size=512,             # Target characters per chunk
    chunk_overlap=50,           # Context overlap for better retrieval
    strategy=SplitStrategy.SMART, # Respects sentence/paragraph boundaries
    
    # Data Cleaning Rules (Rust-native regex)
    clean_html=True,            # Remove <div>, <br>, etc.
    clean_unicode=True,         # Normalize whitespace and emojis
    min_chunk_len=20,           # Discard garbage/empty chunks
    
    # System Performance
    threads=8,                  # Force usage of 8 CPU cores
    batch_size=1000             # Write to disk every 1k chunks (Low RAM usage)
)

# Initialize with config
pipeline = Pipeline(config=config)

# Execute
result = pipeline.process(
    input_path="large_corpus.csv", 
    output_path="corpus_vectors.jsonl"
)

print(f"Job ID: {result.job_id}")
print(f"Throughput: {result.mb_per_second:.2f} MB/s")

📄 Output Format

The library outputs standard JSONL (JSON Lines), ready for direct ingestion into vector databases (Pinecone, Weaviate, Qdrant).

processed_chunks.jsonl:

{"text": "The mitochondria is the powerhouse...", "metadata": {"source": "doc1.csv", "row": 1, "chunk_index": 0}}
{"text": "It generates most of the chemical energy...", "metadata": {"source": "doc1.csv", "row": 1, "chunk_index": 1}}

🏗️ Architecture

Krira differs from standard Python loaders by offloading the entire ETL process to a compiled Rust binary.

  1. Memory Mapping (mmap): The file is mapped directly from disk to virtual memory. No loading 1GB CSVs into Python RAM.
  2. Rayon Parallelism: The file is sliced into segments and processed across all available CPU cores simultaneously.
  3. Serde Serialization: Chunks are serialized to JSONL directly on the Rust thread, minimizing Python GIL interaction.

🤝 Integration Example

Seamlessly integrate with generic Python generators to feed embeddings.

import json
import openai

def stream_chunks(jsonl_path):
    """Yields chunks efficiently for embedding API calls."""
    with open(jsonl_path, 'r') as f:
        for line in f:
            yield json.loads(line)

# Use in your downstream application
for chunk in stream_chunks("processed_chunks.jsonl"):
    # Mock embedding call
    # embedding = openai.Embedding.create(input=chunk['text'])
    pass
    
    # Upsert to Vector DB (e.g., Pinecone)
    # index.upsert(vectors=[(chunk['id'], embedding, chunk['metadata'])])

🧑‍💻 Development

If you want to modify the Rust core:

  1. Clone the repo
  2. Install Maturin (Rust-Python bridge builder)
    pip install maturin
    
  3. Build and Install locally
    maturin develop --release
    

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krira_augment-2.0.1.tar.gz (2.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krira_augment-2.0.1-cp313-cp313-win_amd64.whl (672.1 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file krira_augment-2.0.1.tar.gz.

File metadata

  • Download URL: krira_augment-2.0.1.tar.gz
  • Upload date:
  • Size: 2.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for krira_augment-2.0.1.tar.gz
Algorithm Hash digest
SHA256 35fec7c615450e7e638dd160a0d38686493633bc80cab3e09be505b59e0d2727
MD5 6815e8436ab2e059d23e35200f72c191
BLAKE2b-256 e88874983c11a12e0507fd992ea39d3a79559c5be0ebf9f463c5680428bb1867

See more details on using hashes here.

File details

Details for the file krira_augment-2.0.1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for krira_augment-2.0.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 8c3c15f13061e26f9c16e1fc68e6b7d8e9c765d38d92aa56fb4fd303e2a0fc33
MD5 3703963574dc772aab05d0f7fb9ed46e
BLAKE2b-256 7403fc7d189e616b000a3011d2d38022c7636c0a5d12cd6f9446a68cdcf4c664

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page