Skip to main content

CLI tool to parse, chunk, and evaluate Markdown documents for RAG pipelines with token-accurate chunking support

Project description

rag-chunk

Current Version: 0.4.0 ๐ŸŽ‰

CLI tool to parse, chunk, and evaluate Markdown documents for Retrieval-Augmented Generation (RAG) pipelines with token-accurate chunking and semantic intelligence.

Available on PyPI: https://pypi.org/project/rag-chunk/

๐Ÿ“š Documentation | ๐Ÿ™ GitHub

โœจ Features

  • ๐Ÿ“„ Parse and clean Markdown files
  • โœ‚๏ธ 6 Chunking Strategies:
    • fixed-size: Split by fixed word/token count
    • sliding-window: Overlapping chunks for context preservation
    • paragraph: Natural paragraph boundaries
    • recursive-character: LangChain's semantic splitter
    • header: NEW - Markdown header-aware splitting
    • semantic: NEW - Embedding-based semantic boundaries
  • ๐Ÿง  Semantic Intelligence: Embedding-based chunking and retrieval
  • ๐ŸŽฏ Token-based chunking with tiktoken (OpenAI models: GPT-3.5, GPT-4, etc.)
  • ๐ŸŽจ Model selection via --tiktoken-model flag
  • ๐Ÿ“Š Advanced Metrics: Precision, Recall, F1-score evaluation
  • ๐ŸŒˆ Beautiful CLI output with Rich tables
  • ๐Ÿ“ˆ Compare all strategies with --strategy all
  • ๐Ÿ’พ Export results as JSON or CSV

Demo

rag-chunk demo

๐Ÿš€ Roadmap

rag-chunk is actively developed! Here's the plan to move from a useful tool to a full-featured chunking workbench.

โœ… Version 0.1.0 โ€“ Launched

  • Core CLI engine (argparse)
  • Markdown (.md) file parsing
  • Basic chunking strategies: fixed-size, sliding-window, and paragraph (word-based)
  • Evaluation harness: calculate Recall score from a test-file.json
  • Beautiful CLI output (rich tables)
  • Published on PyPI: pip install rag-chunk

โœ… Version 0.2.0 โ€“ Completed

  • Tiktoken Support: Added --use-tiktoken flag for precise token-based chunking
  • Model Selection: Added --tiktoken-model to choose tokenization model (default: gpt-3.5-turbo)
  • Improved Documentation: Updated README with tiktoken usage examples and comparisons
  • Enhanced Testing: Added comprehensive unit tests for token-based chunking
  • Optional Dependencies: tiktoken available via pip install rag-chunk[tiktoken]

โœ… Version 0.3.0 โ€“ Released

  • Recursive Character Splitting: Add LangChain's RecursiveCharacterTextSplitter for semantic chunking
    • Install with: pip install rag-chunk[langchain]
    • Strategy: --strategy recursive-character
    • Works with both word-based and tiktoken modes
  • More File Formats: Support .txt files
  • Additional Metrics: Add precision, F1-score, and chunk quality metrics

โœ… Version 0.4.0 โ€“ Released

  • Header-Aware Chunking: Split by markdown headers while respecting size limits
    • Strategy: --strategy header
    • Preserves document structure with metadata
  • Semantic Chunking: Use sentence embeddings to split at semantic boundaries
    • Strategy: --strategy semantic
    • Install with: pip install rag-chunk[semantic]
    • Powered by sentence-transformers
  • Embedding-Based Retrieval: Semantic similarity matching with --use-embeddings
    • Superior to lexical matching for semantic queries
    • Uses cosine similarity on sentence embeddings
  • Documentation Site: Complete GitHub Pages documentation

๐Ÿ“ˆ Version 1.0.0 โ€“ Planned

  • Chunk Size Optimizer: Automated sweep to find optimal chunk size
  • Visualization Dashboard: HTML report with interactive charts
  • Context Augmentation: Add metadata (position, section, summaries) to chunks
  • Export Connectors: Direct integration with vector stores (Pinecone, Weaviate, Chroma)
  • Benchmarking Mode: Statistical comparison with significance testing
  • MLFlow Integration: Track experiments and chunking configurations
  • Performance Optimization: Parallel processing for large document sets

Installation

pip install rag-chunk
## Features

- Parse and clean Markdown files in a folder
- Chunk text using fixed-size, sliding-window, or paragraph-based strategies
- Evaluate chunk recall based on a provided test JSON file
- Output results as table, JSON, or CSV
- Store generated chunks temporarily in `.chunks`

## Installation

```bash
# Base installation
pip install rag-chunk

# With all features (recommended)
pip install rag-chunk[all]

# Or install specific features:
pip install rag-chunk[tiktoken]    # Token-based chunking
pip install rag-chunk[semantic]    # Semantic chunking & retrieval
pip install rag-chunk[langchain]   # Recursive character splitting

Development mode:

pip install -e .[all]

Quick Start

# Compare all strategies with semantic retrieval
rag-chunk analyze examples/ \
  --strategy all \
  --use-embeddings \
  --test-file examples/questions.json \
  --top-k 3

# Header-aware chunking for technical docs
rag-chunk analyze docs/ --strategy header --chunk-size 300

# Semantic chunking with embeddings
rag-chunk analyze examples/ \
  --strategy semantic \
  --chunk-size 200 \
  --use-embeddings \
  --test-file questions.json

CLI Usage

rag-chunk analyze <folder> [options]

Options

Option Description Default
--strategy Chunking strategy: fixed-size, sliding-window, paragraph, recursive-character, header, semantic, or all fixed-size
--chunk-size Number of words or tokens per chunk 200
--overlap Number of overlapping words or tokens 50
--use-tiktoken Use tiktoken for precise token-based chunking False
--tiktoken-model Model for tiktoken encoding gpt-3.5-turbo
--use-embeddings Use semantic embeddings for retrieval (requires sentence-transformers) False
--test-file Path to JSON test file with questions None
--top-k Number of chunks to retrieve per question 3
--output Output format: table, json, or csv table

If --strategy all is chosen, every strategy is run with the supplied chunk-size and overlap where applicable.

Examples

Header-Aware Chunking (NEW)

Split by markdown headers while respecting chunk size limits:

rag-chunk analyze docs/ --strategy header --chunk-size 300

Preserves document structure with metadata about headers and hierarchy levels.

Semantic Chunking (NEW)

Use sentence embeddings to split at semantic boundaries:

rag-chunk analyze examples/ --strategy semantic --chunk-size 200

Groups semantically similar sentences together, splitting when topic shifts occur.

Embedding-Based Retrieval (NEW)

Compare lexical vs semantic retrieval:

# Lexical (keyword-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json

# Semantic (embedding-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json --use-embeddings

Semantic retrieval typically achieves higher recall for conceptual queries.

Basic Usage: Generate Chunks Only

Analyze markdown files and generate chunks without evaluation:

rag-chunk analyze examples/ --strategy paragraph

Output:

strategy  | chunks | avg_recall | saved                            
----------+--------+------------+----------------------------------
paragraph | 12     | 0.0        | .chunks/paragraph-20251115-020145
Total text length (chars): 3542

Compare All Strategies

Run all chunking strategies with custom parameters:

rag-chunk analyze examples/ --strategy all --chunk-size 100 --overlap 20 --output table

Output:

strategy           | chunks | avg_recall | avg_precision | avg_f1 | saved                                 
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size         | 36     | 0.0        | 0.0           | 0.0    | .chunks/fixed-size-20251115-020156    
sliding-window     | 45     | 0.0        | 0.0           | 0.0    | .chunks/sliding-window-20251115-020156
paragraph          | 12     | 0.0        | 0.0           | 0.0    | .chunks/paragraph-20251115-020156
recursive-character| 28     | 0.0        | 0.0           | 0.0    | .chunks/recursive-character-20251115-020156
header             | 15     | 0.0        | 0.0           | 0.0    | .chunks/header-20251115-020156
semantic           | 22     | 0.0        | 0.0           | 0.0    | .chunks/semantic-20251115-020156
Total text length (chars): 3542

Evaluate with Test File

Measure recall using a test file with questions and relevant phrases:

rag-chunk analyze examples/ --strategy all --chunk-size 150 --overlap 30 --test-file examples/questions.json --top-k 3 --use-embeddings --output table

Output:

strategy           | chunks | avg_recall | avg_precision | avg_f1 | saved                                 
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size         | 24     | 0.7812     | 0.7812        | 0.7812 | .chunks/fixed-size-20251115-020203    
sliding-window     | 32     | 0.8542     | 0.8542        | 0.8542 | .chunks/sliding-window-20251115-020203
paragraph          | 12     | 0.9167     | 0.9167        | 0.9167 | .chunks/paragraph-20251115-020203
semantic           | 19     | 0.9583     | 0.9583        | 0.9583 | .chunks/semantic-20251115-020203

Semantic chunking with embedding-based retrieval achieves highest recall (95.83%) by preserving semantic coherence.

Export Results as JSON

rag-chunk analyze examples/ --strategy sliding-window --chunk-size 120 --overlap 40 --test-file examples/questions.json --top-k 5 --output json > results.json

Output structure:

{
  "results": [
    {
      "strategy": "sliding-window",
      "chunks": 38,
      "avg_recall": 0.8958,
      "avg_precision": 0.8958,
      "avg_f1": 0.8958,
      "saved": ".chunks/sliding-window-20251115-020210"
    }
  ],
  "detail": {
    "per_questions": [
      {
        "question": "What are the three main stages of a RAG pipeline?",
        "recall": 1.0,
        "precision": 1.0,
        "f1": 1.0
      },
      {
        "question": "What is the main advantage of RAG over pure generative models?",
        "recall": 0.6667,
        "precision": 0.6667,
        "f1": 0.6667
      }
    ]
  }
}

Export as CSV

rag-chunk analyze examples/ --strategy all --test-file examples/questions.json --output csv

Creates analysis_results.csv with columns: strategy, chunks, avg_recall, avg_precision, avg_f1, saved.

Using Tiktoken for Precise Token-Based Chunking

By default, rag-chunk uses word-based tokenization (whitespace splitting). For precise token-level chunking that matches LLM context limits (e.g., GPT-3.5/GPT-4), use the --use-tiktoken flag.

Installation

pip install rag-chunk[tiktoken]

Usage Examples

Token-based fixed-size chunking:

rag-chunk analyze examples/ --strategy fixed-size --chunk-size 512 --use-tiktoken --output table

This creates chunks of exactly 512 tokens (as counted by tiktoken for GPT models), not 512 words.

Compare word-based vs token-based chunking:

# Word-based (default)
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --output json

# Token-based
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --use-tiktoken --output json

Token-based with sliding window:

rag-chunk analyze examples/ --strategy sliding-window --chunk-size 1024 --overlap 128 --use-tiktoken --test-file examples/questions.json --top-k 3

When to Use Tiktoken

  • โœ… Use tiktoken when:

    • Preparing chunks for OpenAI models (GPT-3.5, GPT-4)
    • You need to respect strict token limits (e.g., 8k, 16k context windows)
    • Comparing chunking strategies with token-accurate measurements
    • Your documents contain special characters, emojis, or non-ASCII text
  • โš ๏ธ Use word-based (default) when:

    • Quick prototyping and testing
    • Working with well-formatted English text
    • Don't need exact token counts
    • Want to avoid the tiktoken dependency

Token Counting

You can also use tiktoken in your own scripts:

from src.chunker import count_tokens

text = "Your document text here..."

# Word-based count
word_count = count_tokens(text, use_tiktoken=False)
print(f"Words: {word_count}")

# Token-based count (requires tiktoken installed)
token_count = count_tokens(text, use_tiktoken=True)
print(f"Tokens: {token_count}")

Test File Format

JSON file with a questions array (or direct array at top level):

{
  "questions": [
    {
      "question": "What are the three main stages of a RAG pipeline?",
      "relevant": ["indexing", "retrieval", "generation"]
    },
    {
      "question": "What is the main advantage of RAG over pure generative models?",
      "relevant": ["grounding", "retrieved documents", "hallucinate"]
    }
  ]
}
  • question: The query text used for chunk retrieval
  • relevant: List of phrases/terms that should appear in relevant chunks

Recall calculation: For each question, the tool retrieves top-k chunks using lexical similarity and checks how many relevant phrases appear in those chunks. Recall = (found phrases) / (total relevant phrases). Average recall is computed across all questions.

Understanding the Output

Chunks

Number of chunks created by the strategy. More chunks = finer granularity but higher indexing cost.

Metrics

  • Average Recall: Percentage of relevant phrases successfully retrieved (0.0 to 1.0). Higher is better.
  • Average Precision: Ratio of relevant content in retrieved chunks (0.0 to 1.0). Higher is better.
  • Average F1-Score: Harmonic mean of precision and recall (0.0 to 1.0). Balanced measure of quality.

Interpreting scores:

  • > 0.85: Excellent - strategy works very well for your content
  • 0.70 - 0.85: Good - acceptable for most use cases
  • 0.50 - 0.70: Fair - consider adjusting chunk size or strategy
  • < 0.50: Poor - important information being lost or fragmented

Saved Location

Directory where chunks are written as individual .txt files for inspection.

Choosing the Right Strategy

Strategy Best For Pros Cons
Fixed-Size Consistent chunk sizes, simple docs Fast, deterministic May break semantic boundaries
Sliding-Window Preventing boundary loss Preserves context at edges Redundancy, more chunks
Paragraph Well-structured docs Preserves semantic coherence Variable chunk sizes
Recursive-Character General purpose Good balance, semantic-aware Requires LangChain
Header โญ Technical docs, markdown Preserves document structure Requires header markup
Semantic โญ Maximum retrieval quality Best semantic coherence Requires embeddings, slower

Recommendations

  • Technical documentation: Use header strategy to preserve structure
  • Knowledge bases: Use semantic for best retrieval quality
  • General content: Start with recursive-character or paragraph
  • Token-limited models: Enable --use-tiktoken for accurate counting
  • Evaluation: Always use --use-embeddings with test files for better semantic matching
Strategy Best For Chunk Size Recommendation
fixed-size Uniform processing, consistent latency 150-250 words
sliding-window Preserving context at boundaries, dense text 120-200 words, 20-30% overlap
paragraph Well-structured docs with clear sections N/A (variable)

General guidelines:

  1. Start with paragraph for markdown with clear structure
  2. Use sliding-window if paragraphs are too long (>300 words)
  3. Use fixed-size as baseline for comparison
  4. Always test with representative questions from your domain

Extending

Add a new chunking strategy:

  1. Implement a function in src/chunker.py:
def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
    chunks = []
    # Your logic here
    chunks.append({"id": 0, "text": "chunk text"})
    return chunks
  1. Register in STRATEGIES:
STRATEGIES = {
    "custom": lambda text, chunk_size=200, overlap=0: my_custom_chunks(text, chunk_size, overlap),
    ...
}
  1. Use via CLI:
rag-chunk analyze docs/ --strategy custom --chunk-size 180

Project Structure

rag-chunk/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ parser.py       # Markdown parsing and cleaning
โ”‚   โ”œโ”€โ”€ chunker.py      # Chunking strategies
โ”‚   โ”œโ”€โ”€ scorer.py       # Retrieval and recall evaluation
โ”‚   โ””โ”€โ”€ cli.py          # Command-line interface
โ”œโ”€โ”€ tests/
โ”‚   โ””โ”€โ”€ test_basic.py   # Unit tests
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ rag_introduction.md
โ”‚   โ”œโ”€โ”€ chunking_strategies.md
โ”‚   โ”œโ”€โ”€ evaluation_metrics.md
โ”‚   โ””โ”€โ”€ questions.json
โ”œโ”€โ”€ .chunks/            # Generated chunks (gitignored)
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ .gitignore

License

MIT

Note on Tokenization

By default, --chunk-size and --overlap count words (whitespace-based tokenization). This keeps the tool simple and dependency-free.

For precise token-level chunking that matches LLM token counts (e.g., OpenAI GPT models using subword tokenization), use the --use-tiktoken flag after installing the optional dependency:

pip install rag-chunk[tiktoken]
rag-chunk analyze docs/ --strategy fixed-size --chunk-size 512 --use-tiktoken

See the Using Tiktoken section for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_chunk-0.4.0.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_chunk-0.4.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file rag_chunk-0.4.0.tar.gz.

File metadata

  • Download URL: rag_chunk-0.4.0.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for rag_chunk-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6cd9c405372128d55fba30e1a6d4bf9b79bbde14b930e4650ac9952ca04d94db
MD5 eb82175126a80b22a09881c9f3452efa
BLAKE2b-256 aa3a99ede533d9ad64cb99f07f0249a083b70d3d7473539d8f1c913d87cd594a

See more details on using hashes here.

File details

Details for the file rag_chunk-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: rag_chunk-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for rag_chunk-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ffe7e4e0d37c7657a2f8d2db3f4c4ba7d830c7123484adea9af6c7150848aae
MD5 3777d87ff0b54da0c6c175ff9c45ebc6
BLAKE2b-256 0a4b982638331f3bbcffa7e06a1ec69c3108154f51d78c23da01b1cc8725bd16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page