Skip to main content

A zero-framework, low-latency semantic chunker with visual HTML debugging.

Project description

PyPI Version Python Version License

AeroChunk: Stateless Semantic Chunker for Production RAG

1. Overview

AeroChunk is a high-performance, stateless semantic text chunker engineered for production-grade Retrieval-Augmented Generation (RAG) systems. It provides a robust, thread-safe solution for segmenting text into conceptually coherent blocks, operating with a minimal memory footprint and zero external dependencies or API calls.

The library has been re-architected from a prototype to a production-ready tool, replacing fragile regex-based sentence splitting with a sophisticated NLP-powered pipeline. It is designed for high-concurrency environments (e.g., FastAPI, Django) where statelessness and computational efficiency are critical.

2. Quick Start

Install the library and its "enterprise" dependencies, which include spacy and the required NLP model.

pip install "aerochunk[enterprise]"
python -m spacy download en_core_web_sm

The chunk_text method now returns a ChunkingResult object, providing access to the chunks and other metadata in a stateless manner.

from aerochunk import AeroChunker

text = """Machine learning (ML) is a field of study in artificial intelligence. It is concerned with the development and study of statistical algorithms that can learn from data. For example, Mr. Smith noted a 45.5% increase. These algorithms generalize to unseen data. Recently, artificial neural networks have surpassed many previous approaches in performance."""

# Initialize the chunker for stateless, production use
chunker = AeroChunker(enterprise=True)

# Process the text; returns a ChunkingResult object
result = chunker.chunk_text(text)

# Access the chunks
for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i+1}: {chunk}\n")

# The result object also contains the similarity scores and sentence boundaries
# print(result.sentences)
# print(result.similarities)

3. Core Architecture (v0.2.1+ "Enterprise" Overhaul)

The AeroChunk pipeline is a multi-stage process designed for semantic accuracy and computational efficiency.

  1. NLP Sentence Boundary Detection: The input text is first segmented into sentences using spaCy's sentencizer. This provides robust, context-aware tokenization that correctly handles complex cases like abbreviations ("Mr. Smith"), decimals ("45.5%"), and nested punctuation, eliminating the fragility of regex-based splitting.
  2. Vectorization: Each sentence is converted into a 384-dimensional vector embedding using a local sentence-transformers model (all-MiniLM-L6-v2 by default).
  3. Windowed Semantic Similarity: To determine chunk boundaries, a new sentence is not just compared to its immediate predecessor. Instead, its embedding is compared against the rolling mean embedding of the current chunk's sentences. This method respects the progressive narrative flow and ensures that new sentences are evaluated against the broader semantic context of the chunk.
  4. Chunking with Contextual Overlap: A semantic boundary is declared when the similarity score drops below a threshold. To maintain contextual continuity (e.g., for pronoun resolution in RAG), the overlap_sentences parameter carries over the last N sentences of a completed chunk to the beginning of the next one.
  5. Structural Guardrails: The chunking process is constrained by min_sentences and max_sentences parameters. These act as a floor and ceiling, preventing the formation of semantically fragmented micro-chunks or excessively long chunks that could overflow an LLM's context window.

4. Performance Benchmarks

AeroChunk's stateless architecture and optimized processing deliver significant performance advantages, especially under high-load conditions. The following benchmark was conducted on a 15,000+ word document.

Method Peak RAM (MB) Execution Time (s) Chunks Generated Outcome
AeroChunk (Batch 32) 6.62 11.18 1500 Optimal
LangChain (Semantic) 21.56 18.73 4 Failed to Split

Key Takeaway: AeroChunk achieves a ~69% reduction in peak RAM usage and executes significantly faster, demonstrating its suitability for memory-constrained and low-latency applications.

5. Visual Debugger

AeroChunk includes an industry-first tool for visualizing the chunking process. After processing text, call the export_debug_html() method to generate an HTML report. This file provides a transparent, interpretable view of the semantic boundaries identified, making it invaluable for fine-tuning the similarity threshold and other parameters.

# (Continuing from the Quick Start example)

# Generate the HTML report
debug_file = chunker.export_debug_html(
    result=result, 
    output_file="aero_debug_report.html"
)

print(f"Visual debug report saved to: {debug_file}")

6. License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aerochunk-0.2.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aerochunk-0.2.0-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file aerochunk-0.2.0.tar.gz.

File metadata

  • Download URL: aerochunk-0.2.0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for aerochunk-0.2.0.tar.gz
Algorithm Hash digest
SHA256 71d66b885e0b9bcb3ba24b53ff1960d5fbe9c40aba1fd6a750ba55ae57213eff
MD5 3a403c3a136c376a037ca6fc2853f022
BLAKE2b-256 b566e3da1a11ced1c58653f3333732e4a6fef92248432ae083733ec95a469aed

See more details on using hashes here.

File details

Details for the file aerochunk-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: aerochunk-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for aerochunk-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 428d1833a838df058e667cdfd16f3b691290429f729a9afa70b237ed104cf2a6
MD5 9079ec74485043acc6815049fe15e53f
BLAKE2b-256 2fb347af01641d7802b542a9c7b6ff5c592747e11ded7b169bbf811dfa009420

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page