Skip to main content

Production-grade async text chunking framework for RAG systems

Project description

ChunkFlow

Production-grade async text chunking framework for RAG systems

PyPI version PyPI - Downloads PyPI - Python Version License: MIT Code style: black GitHub stars

ChunkFlow is a comprehensive, extensible framework for text chunking in Retrieval-Augmented Generation (RAG) systems. Built with production-grade practices, it provides multiple chunking strategies, pluggable embedding providers, and comprehensive evaluation metrics to help you make data-driven decisions.

Why ChunkFlow?

RAG systems process billions of documents daily, and chunking quality directly impacts retrieval accuracy, computational costs, and user experience. Poor chunking causes hallucinations, missed context, and wasted API calls.

ChunkFlow addresses this with:

  • 6+ chunking strategies - From simple fixed-size to revolutionary late chunking
  • Pluggable architecture - Easy integration with any embedding provider
  • Comprehensive evaluation - 12+ metrics including RAGAS-inspired, NDCG, semantic coherence
  • Data-driven comparison - Built-in strategy comparison and ranking framework
  • Production-ready - Async-first, type-safe, structured logging, extensible design

Key Features

Chunking Strategies

  • Fixed-Size - Simple character/token-based splitting (10K+ chunks/sec)
  • Recursive - Hierarchical splitting with natural boundaries (recommended default)
  • Document-Based - Format-aware (Markdown, HTML)
  • Semantic - Embedding-based topic detection with similarity thresholds
  • Late Chunking - Revolutionary context-preserving approach (6-9% accuracy improvement, Jina AI 2024)

Embedding Providers

  • OpenAI - text-embedding-3-small/large with automatic cost tracking
  • HuggingFace - Sentence Transformers (local, free, GPU/CPU support)
  • Extensible - Easy to add custom providers via EmbeddingProvider base class

Evaluation Metrics

  • Retrieval (4 metrics): NDCG@k, Recall@k, Precision@k, MRR
  • Semantic (4 metrics): Coherence, Boundary Quality, Chunk Stickiness (MoC), Topic Diversity
  • RAG Quality (4 metrics): Context Relevance, Answer Faithfulness, Context Precision, Context Recall (RAGAS-inspired)
  • Framework: Unified EvaluationPipeline + StrategyComparator for comprehensive analysis

Quick Start

Installation

# Basic installation
pip install chunk-flow

# With specific providers
pip install chunk-flow[openai]
pip install chunk-flow[huggingface]

# With API server
pip install chunk-flow[api]

# Everything
pip install chunk-flow[all]

Basic Usage

from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline

# 1. Chunk your document
chunker = StrategyRegistry.create("recursive", {"chunk_size": 512, "overlap": 100})
result = await chunker.chunk(document)

# 2. Embed chunks
embedder = EmbeddingProviderFactory.create("openai", {"model": "text-embedding-3-small"})
emb_result = await embedder.embed_texts(result.chunks)

# 3. Evaluate quality (semantic metrics - no ground truth needed)
pipeline = EvaluationPipeline(metrics=["semantic_coherence", "boundary_quality", "chunk_stickiness"])
metrics = await pipeline.evaluate(
    chunks=result.chunks,
    embeddings=emb_result.embeddings,
)

print(f"Semantic Coherence: {metrics['semantic_coherence'].score:.4f}")
print(f"Boundary Quality: {metrics['boundary_quality'].score:.4f}")

Strategy Comparison

Compare multiple strategies to find the best for your use case:

from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline, StrategyComparator

# Create strategies to compare
strategies = [
    StrategyRegistry.create("fixed_size", {"chunk_size": 500, "overlap": 50}),
    StrategyRegistry.create("recursive", {"chunk_size": 400, "overlap": 80}),
    StrategyRegistry.create("semantic", {"threshold_percentile": 80}),
]

# Get embedder
embedder = EmbeddingProviderFactory.create("huggingface")

# Set up evaluation pipeline
pipeline = EvaluationPipeline(
    metrics=["ndcg_at_k", "semantic_coherence", "chunk_stickiness"],
)

# Compare strategies
comparison = await pipeline.compare_strategies(
    strategies=strategies,
    text=document,
    ground_truth={"query_embedding": query_emb, "relevant_indices": [0, 2, 5]},
)

# Generate comparison report
report = StrategyComparator.generate_comparison_report(
    {name: comparison["strategies"][name]["metric_results"]
     for name in comparison["strategies"].keys()}
)
print(report)

# See examples/strategy_comparison.py for complete working example

API Server

# Start FastAPI server
uvicorn chunk_flow.api.app:app --reload

# Use the API
curl -X POST "http://localhost:8000/chunk" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your document here...",
    "strategy": "recursive",
    "strategy_config": {"chunk_size": 512}
  }'

Architecture

ChunkFlow follows a clean, extensible architecture:

┌─────────────────────────────────────────────────────────────┐
│                     API Layer (FastAPI)                     │
│  /chunk, /evaluate, /compare, /benchmark, /export          │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    Orchestration Layer                      │
│  ChunkingPipeline | EvaluationEngine | ResultsAggregator   │
└─────────────────────────────────────────────────────────────┘
                              ↓
        ┌─────────────────────┴─────────────────────┐
        ↓                                           ↓
┌──────────────────┐                    ┌──────────────────────┐
│ Chunking Module  │                    │  Embedding Module    │
│ ----------------│                    │ -------------------- │
│ • Fixed-Size    │                    │ • OpenAI             │
│ • Recursive     │                    │ • HuggingFace        │
│ • Semantic      │                    │ • Google Vertex      │
│ • Late          │                    │ • Cohere             │
└──────────────────┘                    └──────────────────────┘

Research-Backed

ChunkFlow implements cutting-edge research findings:

  • Late Chunking (Jina AI, 2025): 6-9% improvement in retrieval accuracy
  • Optimal Chunk Sizes (Bhat et al., 2025): 64-128 tokens for facts, 512-1024 for context
  • Semantic Independence (HOPE, 2025): 56% gain in factual correctness
  • MoC Metrics (Zhao et al., 2025): Boundary clarity and chunk stickiness
  • RAGAS (ExplodingGradients, 2023): Reference-free RAG evaluation

See rag-summery-framework.md for comprehensive research review.

Documentation

Development

# Clone repository
git clone https://github.com/chunkflow/chunk-flow.git
cd chunk-flow

# Install with dev dependencies
make install-dev

# Run tests
make test

# Format and lint
make format
make lint

# Run full CI locally
make ci

Contributing

ChunkFlow is currently a solo project. While contributions are not being accepted at this time, you can:

See CONTRIBUTING.md for more details.

Roadmap

Phase 1-4: Core Framework ✅ COMPLETED

  • Core chunking strategies (Fixed, Recursive, Document-based)
  • Embedding providers (OpenAI, HuggingFace)
  • Semantic chunking
  • Late chunking implementation
  • Comprehensive evaluation metrics (12 metrics across 3 categories)
  • Evaluation pipeline and comparison framework

Phase 5-6: Analysis & API ✅ COMPLETED

  • ResultsDataFrame with analysis methods
  • Visualization utilities (heatmaps, comparison charts)
  • FastAPI server with all endpoints
  • Docker setup (multi-stage, production-ready)

Phase 7-9: Testing & Release ✅ COMPLETED

  • Comprehensive testing (unit, integration, E2E)
  • Benchmark suite with standard datasets
  • CI/CD pipeline (GitHub Actions)
  • Complete documentation
  • PyPI package release workflow
  • Production deployment guides

v0.1.0 READY FOR RELEASE! 🚀

Future Roadmap (v0.2.0+)

  • Additional providers (Google Vertex, Cohere, Voyage AI)
  • LLM-based chunking (GPT/Claude)
  • Streamlit dashboard
  • Redis caching and PostgreSQL storage
  • Agentic chunking with dynamic boundaries
  • Fine-tuning pipeline for custom strategies
  • Public benchmark datasets (BeIR, MS MARCO)

License

MIT License - see LICENSE file for details.

Citation

If you use ChunkFlow in your research, please cite:

@software{chunkflow2024,
  title = {ChunkFlow: Production-Grade Text Chunking Framework for RAG Systems},
  author = {ChunkFlow Development},
  year = {2024},
  url = {https://github.com/chunkflow/chunk-flow}
}

Acknowledgments

ChunkFlow builds on research from Jina AI, ExplodingGradients, and the broader RAG community. Built with passion for the neglected field of text chunking.


Built with passion for the neglected field of text chunking 🚀

Documentation | GitHub | PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunckerflow-0.1.0.tar.gz (85.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunckerflow-0.1.0-py3-none-any.whl (72.1 kB view details)

Uploaded Python 3

File details

Details for the file chunckerflow-0.1.0.tar.gz.

File metadata

  • Download URL: chunckerflow-0.1.0.tar.gz
  • Upload date:
  • Size: 85.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for chunckerflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c4c0832942ac09b942bcd508c48c541b55e6af9d89bc69136b028a4655564445
MD5 ff7057e04303a1bfd15f75549e9ade68
BLAKE2b-256 af18493e2b07178cfcae2018e5aa1f2a100dfe6af46764bfeb3f1a3ea9972deb

See more details on using hashes here.

File details

Details for the file chunckerflow-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chunckerflow-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 72.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for chunckerflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6bd17f8ea0fcb08381ae876542c8025e77f7064f96e13e85109e03a1acbe52e
MD5 2bf7cf991b6fff04f2655b3adb1aaf6a
BLAKE2b-256 a3b09c0c29e3c14c073c70921dd782393748791569620d212c13d008c019de69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page