Production-grade async text chunking framework for RAG systems

These details have not been verified by PyPI

Project links

Project description

ChunkFlow

Production-grade async text chunking framework for RAG systems

ChunkFlow is a comprehensive, extensible framework for text chunking in Retrieval-Augmented Generation (RAG) systems. Built with production-grade practices, it provides multiple chunking strategies, pluggable embedding providers, and comprehensive evaluation metrics to help you make data-driven decisions.

Why ChunkFlow?

RAG systems process billions of documents daily, and chunking quality directly impacts retrieval accuracy, computational costs, and user experience. Poor chunking causes hallucinations, missed context, and wasted API calls.

ChunkFlow addresses this with:

6+ chunking strategies - From simple fixed-size to revolutionary late chunking
Pluggable architecture - Easy integration with any embedding provider
Comprehensive evaluation - 12+ metrics including RAGAS-inspired, NDCG, semantic coherence
Data-driven comparison - Built-in strategy comparison and ranking framework
Production-ready - Async-first, type-safe, structured logging, extensible design

Key Features

Chunking Strategies

Fixed-Size - Simple character/token-based splitting (10K+ chunks/sec)
Recursive - Hierarchical splitting with natural boundaries (recommended default)
Document-Based - Format-aware (Markdown, HTML)
Semantic - Embedding-based topic detection with similarity thresholds
Late Chunking - Revolutionary context-preserving approach (6-9% accuracy improvement, Jina AI 2024)

Embedding Providers

OpenAI - text-embedding-3-small/large with automatic cost tracking
HuggingFace - Sentence Transformers (local, free, GPU/CPU support)
Extensible - Easy to add custom providers via EmbeddingProvider base class

Evaluation Metrics

Retrieval (4 metrics): NDCG@k, Recall@k, Precision@k, MRR
Semantic (4 metrics): Coherence, Boundary Quality, Chunk Stickiness (MoC), Topic Diversity
RAG Quality (4 metrics): Context Relevance, Answer Faithfulness, Context Precision, Context Recall (RAGAS-inspired)
Framework: Unified EvaluationPipeline + StrategyComparator for comprehensive analysis

Quick Start

Installation

# Basic installation
pip install chunk-flow

# With specific providers
pip install chunk-flow[openai]
pip install chunk-flow[huggingface]

# With API server
pip install chunk-flow[api]

# Everything
pip install chunk-flow[all]

Basic Usage

from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline

# 1. Chunk your document
chunker = StrategyRegistry.create("recursive", {"chunk_size": 512, "overlap": 100})
result = await chunker.chunk(document)

# 2. Embed chunks
embedder = EmbeddingProviderFactory.create("openai", {"model": "text-embedding-3-small"})
emb_result = await embedder.embed_texts(result.chunks)

# 3. Evaluate quality (semantic metrics - no ground truth needed)
pipeline = EvaluationPipeline(metrics=["semantic_coherence", "boundary_quality", "chunk_stickiness"])
metrics = await pipeline.evaluate(
    chunks=result.chunks,
    embeddings=emb_result.embeddings,
)

print(f"Semantic Coherence: {metrics['semantic_coherence'].score:.4f}")
print(f"Boundary Quality: {metrics['boundary_quality'].score:.4f}")

Strategy Comparison

Compare multiple strategies to find the best for your use case:

from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline, StrategyComparator

# Create strategies to compare
strategies = [
    StrategyRegistry.create("fixed_size", {"chunk_size": 500, "overlap": 50}),
    StrategyRegistry.create("recursive", {"chunk_size": 400, "overlap": 80}),
    StrategyRegistry.create("semantic", {"threshold_percentile": 80}),
]

# Get embedder
embedder = EmbeddingProviderFactory.create("huggingface")

# Set up evaluation pipeline
pipeline = EvaluationPipeline(
    metrics=["ndcg_at_k", "semantic_coherence", "chunk_stickiness"],
)

# Compare strategies
comparison = await pipeline.compare_strategies(
    strategies=strategies,
    text=document,
    ground_truth={"query_embedding": query_emb, "relevant_indices": [0, 2, 5]},
)

# Generate comparison report
report = StrategyComparator.generate_comparison_report(
    {name: comparison["strategies"][name]["metric_results"]
     for name in comparison["strategies"].keys()}
)
print(report)

# See examples/strategy_comparison.py for complete working example

API Server

# Start FastAPI server
uvicorn chunk_flow.api.app:app --reload

# Use the API
curl -X POST "http://localhost:8000/chunk" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your document here...",
    "strategy": "recursive",
    "strategy_config": {"chunk_size": 512}
  }'

Architecture

ChunkFlow follows a clean, extensible architecture:

┌─────────────────────────────────────────────────────────────┐
│                     API Layer (FastAPI)                     │
│  /chunk, /evaluate, /compare, /benchmark, /export          │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    Orchestration Layer                      │
│  ChunkingPipeline | EvaluationEngine | ResultsAggregator   │
└─────────────────────────────────────────────────────────────┘
                              ↓
        ┌─────────────────────┴─────────────────────┐
        ↓                                           ↓
┌──────────────────┐                    ┌──────────────────────┐
│ Chunking Module  │                    │  Embedding Module    │
│ ----------------│                    │ -------------------- │
│ • Fixed-Size    │                    │ • OpenAI             │
│ • Recursive     │                    │ • HuggingFace        │
│ • Semantic      │                    │ • Google Vertex      │
│ • Late          │                    │ • Cohere             │
└──────────────────┘                    └──────────────────────┘

Research-Backed

ChunkFlow implements cutting-edge research findings:

Late Chunking (Jina AI, 2025): 6-9% improvement in retrieval accuracy
Optimal Chunk Sizes (Bhat et al., 2025): 64-128 tokens for facts, 512-1024 for context
Semantic Independence (HOPE, 2025): 56% gain in factual correctness
MoC Metrics (Zhao et al., 2025): Boundary clarity and chunk stickiness
RAGAS (ExplodingGradients, 2023): Reference-free RAG evaluation

See rag-summery-framework.md for comprehensive research review.

Documentation

📚 Documentation Hub - Complete documentation index
🚀 Getting Started - Installation and quick start
📖 API Reference - Complete API documentation
🐳 Docker Guide - Docker deployment
📓 Examples - Code examples and Jupyter notebooks

Development

# Clone repository
git clone https://github.com/chunkflow/chunk-flow.git
cd chunk-flow

# Install with dev dependencies
make install-dev

# Run tests
make test

# Format and lint
make format
make lint

# Run full CI locally
make ci

Contributing

ChunkFlow is currently a solo project. While contributions are not being accepted at this time, you can:

Report Bugs: GitHub Issues
Request Features: GitHub Issues
Ask Questions: GitHub Discussions
Star the Repo: Help spread the word!

See CONTRIBUTING.md for more details.

Roadmap

Phase 1-4: Core Framework ✅ COMPLETED

Core chunking strategies (Fixed, Recursive, Document-based)
Embedding providers (OpenAI, HuggingFace)
Semantic chunking
Late chunking implementation
Comprehensive evaluation metrics (12 metrics across 3 categories)
Evaluation pipeline and comparison framework

Phase 5-6: Analysis & API ✅ COMPLETED

ResultsDataFrame with analysis methods
Visualization utilities (heatmaps, comparison charts)
FastAPI server with all endpoints
Docker setup (multi-stage, production-ready)

Phase 7-9: Testing & Release ✅ COMPLETED

Comprehensive testing (unit, integration, E2E)
Benchmark suite with standard datasets
CI/CD pipeline (GitHub Actions)
Complete documentation
PyPI package release workflow
Production deployment guides

v0.1.0 READY FOR RELEASE! 🚀

Future Roadmap (v0.2.0+)

Additional providers (Google Vertex, Cohere, Voyage AI)
LLM-based chunking (GPT/Claude)
Streamlit dashboard
Redis caching and PostgreSQL storage
Agentic chunking with dynamic boundaries
Fine-tuning pipeline for custom strategies
Public benchmark datasets (BeIR, MS MARCO)

License

MIT License - see LICENSE file for details.

Citation

If you use ChunkFlow in your research, please cite:

@software{chunkflow2024,
  title = {ChunkFlow: Production-Grade Text Chunking Framework for RAG Systems},
  author = {ChunkFlow Development},
  year = {2024},
  url = {https://github.com/chunkflow/chunk-flow}
}

Acknowledgments

ChunkFlow builds on research from Jina AI, ExplodingGradients, and the broader RAG community. Built with passion for the neglected field of text chunking.

Built with passion for the neglected field of text chunking 🚀

Documentation | GitHub | PyPI

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Oct 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunckerflow-0.1.0.tar.gz (85.5 kB view details)

Uploaded Oct 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunckerflow-0.1.0-py3-none-any.whl (72.1 kB view details)

Uploaded Oct 19, 2025 Python 3

File details

Details for the file chunckerflow-0.1.0.tar.gz.

File metadata

Download URL: chunckerflow-0.1.0.tar.gz
Upload date: Oct 19, 2025
Size: 85.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for chunckerflow-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c4c0832942ac09b942bcd508c48c541b55e6af9d89bc69136b028a4655564445`
MD5	`ff7057e04303a1bfd15f75549e9ade68`
BLAKE2b-256	`af18493e2b07178cfcae2018e5aa1f2a100dfe6af46764bfeb3f1a3ea9972deb`

See more details on using hashes here.

File details

Details for the file chunckerflow-0.1.0-py3-none-any.whl.

File metadata

Download URL: chunckerflow-0.1.0-py3-none-any.whl
Upload date: Oct 19, 2025
Size: 72.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for chunckerflow-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c6bd17f8ea0fcb08381ae876542c8025e77f7064f96e13e85109e03a1acbe52e`
MD5	`2bf7cf991b6fff04f2655b3adb1aaf6a`
BLAKE2b-256	`a3b09c0c29e3c14c073c70921dd782393748791569620d212c13d008c019de69`

See more details on using hashes here.

chunckerflow 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChunkFlow

Why ChunkFlow?

Key Features

Chunking Strategies

Embedding Providers

Evaluation Metrics

Quick Start

Installation

Basic Usage

Strategy Comparison

API Server

Architecture

Research-Backed

Documentation

Development

Contributing

Roadmap

License

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes