A zero-framework, low-latency semantic chunker with visual HTML debugging.
Project description
AeroChunk: Stateless Semantic Chunker for Production RAG
1. Overview
AeroChunk is a high-performance, stateless semantic text chunker engineered for production-grade Retrieval-Augmented Generation (RAG) systems. It provides a robust, thread-safe solution for segmenting text into conceptually coherent blocks, operating with a minimal memory footprint and zero external dependencies or API calls.
The library has been re-architected from a prototype to a production-ready tool, replacing fragile regex-based sentence splitting with a sophisticated NLP-powered pipeline. It is designed for high-concurrency environments (e.g., FastAPI, Django) where statelessness and computational efficiency are critical.
2. Quick Start
Install the library and its "enterprise" dependencies, which include spacy and the required NLP model.
pip install "aerochunk[enterprise]"
python -m spacy download en_core_web_sm
The chunk_text method now returns a ChunkingResult object, providing access to the chunks and other metadata in a stateless manner.
from aerochunk import AeroChunker
text = """Machine learning (ML) is a field of study in artificial intelligence. It is concerned with the development and study of statistical algorithms that can learn from data. For example, Mr. Smith noted a 45.5% increase. These algorithms generalize to unseen data. Recently, artificial neural networks have surpassed many previous approaches in performance."""
# Initialize the chunker for stateless, production use
chunker = AeroChunker(enterprise=True)
# Process the text; returns a ChunkingResult object
result = chunker.chunk_text(text)
# Access the chunks
for i, chunk in enumerate(result.chunks):
print(f"Chunk {i+1}: {chunk}\n")
# The result object also contains the similarity scores and sentence boundaries
# print(result.sentences)
# print(result.similarities)
3. Core Architecture (v0.2.1+ "Enterprise" Overhaul)
The AeroChunk pipeline is a multi-stage process designed for semantic accuracy and computational efficiency.
- NLP Sentence Boundary Detection: The input text is first segmented into sentences using
spaCy'ssentencizer. This provides robust, context-aware tokenization that correctly handles complex cases like abbreviations ("Mr. Smith"), decimals ("45.5%"), and nested punctuation, eliminating the fragility of regex-based splitting. - Vectorization: Each sentence is converted into a 384-dimensional vector embedding using a local
sentence-transformersmodel (all-MiniLM-L6-v2by default). - Windowed Semantic Similarity: To determine chunk boundaries, a new sentence is not just compared to its immediate predecessor. Instead, its embedding is compared against the rolling mean embedding of the current chunk's sentences. This method respects the progressive narrative flow and ensures that new sentences are evaluated against the broader semantic context of the chunk.
- Chunking with Contextual Overlap: A semantic boundary is declared when the similarity score drops below a threshold. To maintain contextual continuity (e.g., for pronoun resolution in RAG), the
overlap_sentencesparameter carries over the last N sentences of a completed chunk to the beginning of the next one. - Structural Guardrails: The chunking process is constrained by
min_sentencesandmax_sentencesparameters. These act as a floor and ceiling, preventing the formation of semantically fragmented micro-chunks or excessively long chunks that could overflow an LLM's context window.
4. Performance Benchmarks
AeroChunk's stateless architecture and optimized processing deliver significant performance advantages, especially under high-load conditions. The following benchmark was conducted on a 15,000+ word document.
| Method | Peak RAM (MB) | Execution Time (s) | Chunks Generated | Outcome |
|---|---|---|---|---|
| AeroChunk (Batch 32) | 6.62 | 11.18 | 1500 | Optimal |
| LangChain (Semantic) | 21.56 | 18.73 | 4 | Failed to Split |
Key Takeaway: AeroChunk achieves a ~69% reduction in peak RAM usage and executes significantly faster, demonstrating its suitability for memory-constrained and low-latency applications.
5. Visual Debugger
AeroChunk includes an industry-first tool for visualizing the chunking process. After processing text, call the export_debug_html() method to generate an HTML report. This file provides a transparent, interpretable view of the semantic boundaries identified, making it invaluable for fine-tuning the similarity threshold and other parameters.
# (Continuing from the Quick Start example)
# Generate the HTML report
debug_file = chunker.export_debug_html(
result=result,
output_file="aero_debug_report.html"
)
print(f"Visual debug report saved to: {debug_file}")
6. License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aerochunk-0.2.0.tar.gz.
File metadata
- Download URL: aerochunk-0.2.0.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71d66b885e0b9bcb3ba24b53ff1960d5fbe9c40aba1fd6a750ba55ae57213eff
|
|
| MD5 |
3a403c3a136c376a037ca6fc2853f022
|
|
| BLAKE2b-256 |
b566e3da1a11ced1c58653f3333732e4a6fef92248432ae083733ec95a469aed
|
File details
Details for the file aerochunk-0.2.0-py3-none-any.whl.
File metadata
- Download URL: aerochunk-0.2.0-py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
428d1833a838df058e667cdfd16f3b691290429f729a9afa70b237ed104cf2a6
|
|
| MD5 |
9079ec74485043acc6815049fe15e53f
|
|
| BLAKE2b-256 |
2fb347af01641d7802b542a9c7b6ff5c592747e11ded7b169bbf811dfa009420
|