Skip to main content

A zero-framework, low-latency semantic chunker with visual HTML debugging.

Project description

PyPI Version Python Version License

AeroChunk: A Zero-Framework Semantic Chunker for RAG

1. Executive Summary

AeroChunk is a low-latency, zero-framework semantic text chunker engineered for Retrieval-Augmented Generation (RAG) pipelines. The efficacy of RAG systems is critically dependent on the quality of text segmentation. Prevailing methodologies often employ naive, fixed-size chunking (e.g., recursive character splitting), which can lead to significant semantic loss by arbitrarily severing conceptual units. Conversely, solutions that do preserve semantic integrity are frequently encumbered by heavy frameworks (e.g., LangChain, LlamaIndex) or reliant on costly, high-latency cloud APIs.

AeroChunk addresses this dichotomy by providing a high-fidelity semantic chunking mechanism that operates entirely on local machine resources. It leverages the computational efficiency of sentence-transformers and numpy to eliminate external dependencies, API costs, and framework bloat.

2. Key Features

  • High-Fidelity Semantic Chunking: Preserves conceptual integrity by segmenting text based on semantic relatedness.
  • Zero-Framework & Local Execution: Operates without requiring frameworks like LangChain or LlamaIndex and runs entirely on local resources, ensuring low latency and zero API costs.
  • Low Memory Footprint: Engineered for efficiency, consuming minimal RAM even with large documents.
  • Visual HTML Debugger: An industry-first tool that generates an HTML report for a transparent, interpretable view of the semantic boundaries identified during the chunking process.

3. Empirical Benchmarks

Standard Document Analysis

AeroChunk was first benchmarked against two standard LangChain text splitters on a repetitive text block to evaluate baseline performance and chunking quality.

Method Execution Time (s) Chunks Generated Outcome
AeroChunk 7.51 72 Optimal
LangChain (Recursive) 0.00 51 Semantic Loss
LangChain (Semantic) 7.53 1 Failed to Split

High-Load Stress Test

To evaluate performance under load, AeroChunk was benchmarked against LangChain's SemanticChunker on a 15,000-word document. The results demonstrate that AeroChunk not only maintains superior segmentation accuracy but also exhibits significantly greater computational efficiency, consuming approximately 69% less peak memory and executing over 40% faster.

Method Peak RAM (MB) Execution Time (s) Chunks Generated Outcome
AeroChunk (Batch 32) 6.62 11.18 1500 Optimal
LangChain (Semantic) 21.56 18.73 4 Failed to Split

4. Architectural Methodology

The AeroChunk pipeline is a four-stage process designed for computational efficiency and semantic accuracy.

  1. Regex Tokenization: The input text is first segmented into individual sentences using a regular expression that identifies sentence-terminating punctuation (., !, ?).
  2. Vectorization: Each sentence is then converted into a 384-dimensional vector embedding using a local sentence-transformers model (all-MiniLM-L6-v2 by default).
  3. Cosine Similarity Analysis: The system computes the pairwise cosine similarity between the vector embeddings of adjacent sentences. This score quantifies the semantic relatedness between them.
  4. Threshold Bounding: A semantic boundary is declared wherever the cosine similarity between two consecutive sentences drops below a predefined threshold (default: 0.5). Sentences are aggregated into a chunk until such a drop is detected, at which point a new chunk begins.

This methodology ensures that chunks are formed from contiguous, semantically related sentences, thereby preserving the conceptual integrity of the source text.

5. Installation

Install the package using pip:

pip install aerochunk

6. Usage

Basic Chunking

Instantiate the AeroChunker and pass text to the chunk_text method.

from aerochunk import AeroChunker

# Your text document
text = """Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms. These algorithms can learn from data and generalize to unseen data. Recently, artificial neural networks have been able to surpass many previous approaches in performance."""

# Initialize the chunker
chunker = AeroChunker()

# Process the text
chunks = chunker.chunk_text(text)

# Print the resulting chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Visual Debugging

After chunking, call the export_debug_html() method to generate a visual report of the chunking decisions. This is useful for fine-tuning the similarity threshold.

# (Continuing from the previous example)

# Generate the HTML report
# This file will show where semantic boundaries were drawn and why.
debug_file = chunker.export_debug_html(output_file="aero_debug_report.html")

print(f"Visual debug report saved to: {debug_file}")

7. License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aerochunk-0.1.2.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aerochunk-0.1.2-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file aerochunk-0.1.2.tar.gz.

File metadata

  • Download URL: aerochunk-0.1.2.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for aerochunk-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0f3d0967cdd3832d14df540e3d4661bf7d3e11efb803444fedcc8b1a2293211f
MD5 1f8540b4232e679d1fe7c1eb6719024f
BLAKE2b-256 5503323cdd2ee2c5f3b95a3f7bd7cd7236b465cb8de48d2a2df556bfda1f8fa6

See more details on using hashes here.

File details

Details for the file aerochunk-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: aerochunk-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for aerochunk-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7c150a1fb6890ae4af4b8e7026916b00eda9258a8a0cfcda581598ba76516a06
MD5 86750cbf201364cf344f2a7f39026af7
BLAKE2b-256 acb182a5d5e393f338cb807cd0c1869d160e0e5719d0c97cdb4d0aec2c2d8a15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page