A zero-framework, low-latency semantic chunker with visual HTML debugging.
Project description
AeroChunk: A Zero-Framework Semantic Chunker for RAG
1. Executive Summary
AeroChunk is a low-latency, zero-framework semantic text chunker engineered for Retrieval-Augmented Generation (RAG) pipelines. The efficacy of RAG systems is critically dependent on the quality of text segmentation. Prevailing methodologies often employ naive, fixed-size chunking (e.g., recursive character splitting), which can lead to significant semantic loss by arbitrarily severing conceptual units. Conversely, solutions that do preserve semantic integrity are frequently encumbered by heavy frameworks (e.g., LangChain, LlamaIndex) or reliant on costly, high-latency cloud APIs.
AeroChunk addresses this dichotomy by providing a high-fidelity semantic chunking mechanism that operates entirely on local machine resources. It leverages the computational efficiency of sentence-transformers and numpy to eliminate external dependencies, API costs, and framework bloat. A key innovation is the Visual HTML Debugger, an industry-first tool that generates an HTML report to provide a transparent, interpretable view of the semantic boundaries identified during the chunking process.
2. Empirical Benchmarks
AeroChunk was benchmarked against two standard LangChain text splitters on a repetitive text block to evaluate performance and chunking quality. The all-MiniLM-L6-v2 model was used for all semantic comparisons to ensure a fair evaluation.
| Method | Execution Time (s) | Chunks Generated | Outcome |
|---|---|---|---|
| AeroChunk | 7.51 | 72 | Optimal |
| LangChain (Recursive) | 0.00 | 51 | Semantic Loss |
| LangChain (Semantic) | 7.53 | 1 | Failed to Split |
Analysis:
- AeroChunk produced semantically coherent chunks at a competitive execution time.
- LangChain's RecursiveCharacterTextSplitter was fast but failed to preserve semantic boundaries, resulting in fragmented and contextually poor chunks.
- LangChain's SemanticChunker failed to identify any valid split points in the document, returning the entire text as a single chunk.
3. Architectural Methodology
The AeroChunk pipeline is a four-stage process designed for computational efficiency and semantic accuracy.
- Regex Tokenization: The input text is first segmented into individual sentences using a regular expression that identifies sentence-terminating punctuation (
.,!,?). - Vectorization: Each sentence is then converted into a 384-dimensional vector embedding using a local
sentence-transformersmodel (all-MiniLM-L6-v2by default). - Cosine Similarity Analysis: The system computes the pairwise cosine similarity between the vector embeddings of adjacent sentences. This score quantifies the semantic relatedness between them.
- Threshold Bounding: A semantic boundary is declared wherever the cosine similarity between two consecutive sentences drops below a predefined threshold (default:
0.5). Sentences are aggregated into a chunk until such a drop is detected, at which point a new chunk begins.
This methodology ensures that chunks are formed from contiguous, semantically related sentences, thereby preserving the conceptual integrity of the source text.
4. Installation
Install the package using pip:
pip install aerochunk
5. Usage
Basic Chunking
Instantiate the AeroChunker and pass text to the chunk_text method.
from aerochunk import AeroChunker
# Your text document
text = """Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms. These algorithms can learn from data and generalize to unseen data. Recently, artificial neural networks have been able to surpass many previous approaches in performance."""
# Initialize the chunker
chunker = AeroChunker()
# Process the text
chunks = chunker.chunk_text(text)
# Print the resulting chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}\n")
Visual Debugging
After chunking, call the export_debug_html() method to generate a visual report of the chunking decisions. This is useful for fine-tuning the similarity threshold.
# (Continuing from the previous example)
# Generate the HTML report
# This file will show where semantic boundaries were drawn and why.
debug_file = chunker.export_debug_html(output_file="aero_debug_report.html")
print(f"Visual debug report saved to: {debug_file}")
6. License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aerochunk-0.1.1.tar.gz.
File metadata
- Download URL: aerochunk-0.1.1.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0793eb8f6e2aa655724acb69a160ba4aeb65ff2f57a87bcd9f3a555c8e5b3499
|
|
| MD5 |
5af04f44521fa3ba4e9ea2fd200e62cb
|
|
| BLAKE2b-256 |
00ae5a5b42e6b8316dc8ee8201066668cfbea19f322c3498de2bb491ca564ce3
|
File details
Details for the file aerochunk-0.1.1-py3-none-any.whl.
File metadata
- Download URL: aerochunk-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d965e5f66eafb5b4b6e40982f9ae6eea0d17d4aa7be6066ca03239be3728c2b
|
|
| MD5 |
2c253e4e0db4ddf2877754d12989740d
|
|
| BLAKE2b-256 |
ca3a9cc0c19e3927c52cbf23b0a79c613e1fd757038b86e17530e4ee6fb14a3c
|