CLI tool to parse, chunk, and evaluate Markdown documents for RAG pipelines with token-accurate chunking support
Project description
rag-chunk
Current Version: 0.4.0 ๐
CLI tool to parse, chunk, and evaluate Markdown documents for Retrieval-Augmented Generation (RAG) pipelines with token-accurate chunking and semantic intelligence.
Available on PyPI: https://pypi.org/project/rag-chunk/
๐ Documentation | ๐ GitHub
โจ Features
- ๐ Parse and clean Markdown files
- โ๏ธ 6 Chunking Strategies:
fixed-size: Split by fixed word/token countsliding-window: Overlapping chunks for context preservationparagraph: Natural paragraph boundariesrecursive-character: LangChain's semantic splitterheader: NEW - Markdown header-aware splittingsemantic: NEW - Embedding-based semantic boundaries
- ๐ง Semantic Intelligence: Embedding-based chunking and retrieval
- ๐ฏ Token-based chunking with tiktoken (OpenAI models: GPT-3.5, GPT-4, etc.)
- ๐จ Model selection via
--tiktoken-modelflag - ๐ Advanced Metrics: Precision, Recall, F1-score evaluation
- ๐ Beautiful CLI output with Rich tables
- ๐ Compare all strategies with
--strategy all - ๐พ Export results as JSON or CSV
Demo
๐ Roadmap
rag-chunk is actively developed! Here's the plan to move from a useful tool to a full-featured chunking workbench.
โ Version 0.1.0 โ Launched
- Core CLI engine (
argparse) - Markdown (
.md) file parsing - Basic chunking strategies:
fixed-size,sliding-window, andparagraph(word-based) - Evaluation harness: calculate Recall score from a
test-file.json - Beautiful CLI output (
richtables) - Published on PyPI:
pip install rag-chunk
โ Version 0.2.0 โ Completed
- Tiktoken Support: Added
--use-tiktokenflag for precise token-based chunking - Model Selection: Added
--tiktoken-modelto choose tokenization model (default:gpt-3.5-turbo) - Improved Documentation: Updated README with tiktoken usage examples and comparisons
- Enhanced Testing: Added comprehensive unit tests for token-based chunking
- Optional Dependencies: tiktoken available via
pip install rag-chunk[tiktoken]
โ Version 0.3.0 โ Released
- Recursive Character Splitting: Add LangChain's
RecursiveCharacterTextSplitterfor semantic chunking- Install with:
pip install rag-chunk[langchain] - Strategy:
--strategy recursive-character - Works with both word-based and tiktoken modes
- Install with:
- More File Formats: Support
.txtfiles - Additional Metrics: Add precision, F1-score, and chunk quality metrics
โ Version 0.4.0 โ Released
- Header-Aware Chunking: Split by markdown headers while respecting size limits
- Strategy:
--strategy header - Preserves document structure with metadata
- Strategy:
- Semantic Chunking: Use sentence embeddings to split at semantic boundaries
- Strategy:
--strategy semantic - Install with:
pip install rag-chunk[semantic] - Powered by sentence-transformers
- Strategy:
- Embedding-Based Retrieval: Semantic similarity matching with
--use-embeddings- Superior to lexical matching for semantic queries
- Uses cosine similarity on sentence embeddings
- Documentation Site: Complete GitHub Pages documentation
๐ Version 1.0.0 โ Planned
- Chunk Size Optimizer: Automated sweep to find optimal chunk size
- Visualization Dashboard: HTML report with interactive charts
- Context Augmentation: Add metadata (position, section, summaries) to chunks
- Export Connectors: Direct integration with vector stores (Pinecone, Weaviate, Chroma)
- Benchmarking Mode: Statistical comparison with significance testing
- MLFlow Integration: Track experiments and chunking configurations
- Performance Optimization: Parallel processing for large document sets
Installation
pip install rag-chunk
## Features
- Parse and clean Markdown files in a folder
- Chunk text using fixed-size, sliding-window, or paragraph-based strategies
- Evaluate chunk recall based on a provided test JSON file
- Output results as table, JSON, or CSV
- Store generated chunks temporarily in `.chunks`
## Installation
```bash
# Base installation
pip install rag-chunk
# With all features (recommended)
pip install rag-chunk[all]
# Or install specific features:
pip install rag-chunk[tiktoken] # Token-based chunking
pip install rag-chunk[semantic] # Semantic chunking & retrieval
pip install rag-chunk[langchain] # Recursive character splitting
Development mode:
pip install -e .[all]
Quick Start
# Compare all strategies with semantic retrieval
rag-chunk analyze examples/ \
--strategy all \
--use-embeddings \
--test-file examples/questions.json \
--top-k 3
# Header-aware chunking for technical docs
rag-chunk analyze docs/ --strategy header --chunk-size 300
# Semantic chunking with embeddings
rag-chunk analyze examples/ \
--strategy semantic \
--chunk-size 200 \
--use-embeddings \
--test-file questions.json
CLI Usage
rag-chunk analyze <folder> [options]
Options
| Option | Description | Default |
|---|---|---|
--strategy |
Chunking strategy: fixed-size, sliding-window, paragraph, recursive-character, header, semantic, or all |
fixed-size |
--chunk-size |
Number of words or tokens per chunk | 200 |
--overlap |
Number of overlapping words or tokens | 50 |
--use-tiktoken |
Use tiktoken for precise token-based chunking | False |
--tiktoken-model |
Model for tiktoken encoding | gpt-3.5-turbo |
--use-embeddings |
Use semantic embeddings for retrieval (requires sentence-transformers) |
False |
--test-file |
Path to JSON test file with questions | None |
--top-k |
Number of chunks to retrieve per question | 3 |
--output |
Output format: table, json, or csv |
table |
If --strategy all is chosen, every strategy is run with the supplied chunk-size and overlap where applicable.
Examples
Header-Aware Chunking (NEW)
Split by markdown headers while respecting chunk size limits:
rag-chunk analyze docs/ --strategy header --chunk-size 300
Preserves document structure with metadata about headers and hierarchy levels.
Semantic Chunking (NEW)
Use sentence embeddings to split at semantic boundaries:
rag-chunk analyze examples/ --strategy semantic --chunk-size 200
Groups semantically similar sentences together, splitting when topic shifts occur.
Embedding-Based Retrieval (NEW)
Compare lexical vs semantic retrieval:
# Lexical (keyword-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json
# Semantic (embedding-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json --use-embeddings
Semantic retrieval typically achieves higher recall for conceptual queries.
Basic Usage: Generate Chunks Only
Analyze markdown files and generate chunks without evaluation:
rag-chunk analyze examples/ --strategy paragraph
Output:
strategy | chunks | avg_recall | saved
----------+--------+------------+----------------------------------
paragraph | 12 | 0.0 | .chunks/paragraph-20251115-020145
Total text length (chars): 3542
Compare All Strategies
Run all chunking strategies with custom parameters:
rag-chunk analyze examples/ --strategy all --chunk-size 100 --overlap 20 --output table
Output:
strategy | chunks | avg_recall | avg_precision | avg_f1 | saved
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size | 36 | 0.0 | 0.0 | 0.0 | .chunks/fixed-size-20251115-020156
sliding-window | 45 | 0.0 | 0.0 | 0.0 | .chunks/sliding-window-20251115-020156
paragraph | 12 | 0.0 | 0.0 | 0.0 | .chunks/paragraph-20251115-020156
recursive-character| 28 | 0.0 | 0.0 | 0.0 | .chunks/recursive-character-20251115-020156
header | 15 | 0.0 | 0.0 | 0.0 | .chunks/header-20251115-020156
semantic | 22 | 0.0 | 0.0 | 0.0 | .chunks/semantic-20251115-020156
Total text length (chars): 3542
Evaluate with Test File
Measure recall using a test file with questions and relevant phrases:
rag-chunk analyze examples/ --strategy all --chunk-size 150 --overlap 30 --test-file examples/questions.json --top-k 3 --use-embeddings --output table
Output:
strategy | chunks | avg_recall | avg_precision | avg_f1 | saved
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size | 24 | 0.7812 | 0.7812 | 0.7812 | .chunks/fixed-size-20251115-020203
sliding-window | 32 | 0.8542 | 0.8542 | 0.8542 | .chunks/sliding-window-20251115-020203
paragraph | 12 | 0.9167 | 0.9167 | 0.9167 | .chunks/paragraph-20251115-020203
semantic | 19 | 0.9583 | 0.9583 | 0.9583 | .chunks/semantic-20251115-020203
Semantic chunking with embedding-based retrieval achieves highest recall (95.83%) by preserving semantic coherence.
Export Results as JSON
rag-chunk analyze examples/ --strategy sliding-window --chunk-size 120 --overlap 40 --test-file examples/questions.json --top-k 5 --output json > results.json
Output structure:
{
"results": [
{
"strategy": "sliding-window",
"chunks": 38,
"avg_recall": 0.8958,
"avg_precision": 0.8958,
"avg_f1": 0.8958,
"saved": ".chunks/sliding-window-20251115-020210"
}
],
"detail": {
"per_questions": [
{
"question": "What are the three main stages of a RAG pipeline?",
"recall": 1.0,
"precision": 1.0,
"f1": 1.0
},
{
"question": "What is the main advantage of RAG over pure generative models?",
"recall": 0.6667,
"precision": 0.6667,
"f1": 0.6667
}
]
}
}
Export as CSV
rag-chunk analyze examples/ --strategy all --test-file examples/questions.json --output csv
Creates analysis_results.csv with columns: strategy, chunks, avg_recall, avg_precision, avg_f1, saved.
Using Tiktoken for Precise Token-Based Chunking
By default, rag-chunk uses word-based tokenization (whitespace splitting). For precise token-level chunking that matches LLM context limits (e.g., GPT-3.5/GPT-4), use the --use-tiktoken flag.
Installation
pip install rag-chunk[tiktoken]
Usage Examples
Token-based fixed-size chunking:
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 512 --use-tiktoken --output table
This creates chunks of exactly 512 tokens (as counted by tiktoken for GPT models), not 512 words.
Compare word-based vs token-based chunking:
# Word-based (default)
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --output json
# Token-based
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --use-tiktoken --output json
Token-based with sliding window:
rag-chunk analyze examples/ --strategy sliding-window --chunk-size 1024 --overlap 128 --use-tiktoken --test-file examples/questions.json --top-k 3
When to Use Tiktoken
-
โ Use tiktoken when:
- Preparing chunks for OpenAI models (GPT-3.5, GPT-4)
- You need to respect strict token limits (e.g., 8k, 16k context windows)
- Comparing chunking strategies with token-accurate measurements
- Your documents contain special characters, emojis, or non-ASCII text
-
โ ๏ธ Use word-based (default) when:
- Quick prototyping and testing
- Working with well-formatted English text
- Don't need exact token counts
- Want to avoid the tiktoken dependency
Token Counting
You can also use tiktoken in your own scripts:
from src.chunker import count_tokens
text = "Your document text here..."
# Word-based count
word_count = count_tokens(text, use_tiktoken=False)
print(f"Words: {word_count}")
# Token-based count (requires tiktoken installed)
token_count = count_tokens(text, use_tiktoken=True)
print(f"Tokens: {token_count}")
Test File Format
JSON file with a questions array (or direct array at top level):
{
"questions": [
{
"question": "What are the three main stages of a RAG pipeline?",
"relevant": ["indexing", "retrieval", "generation"]
},
{
"question": "What is the main advantage of RAG over pure generative models?",
"relevant": ["grounding", "retrieved documents", "hallucinate"]
}
]
}
question: The query text used for chunk retrievalrelevant: List of phrases/terms that should appear in relevant chunks
Recall calculation: For each question, the tool retrieves top-k chunks using lexical similarity and checks how many relevant phrases appear in those chunks. Recall = (found phrases) / (total relevant phrases). Average recall is computed across all questions.
Understanding the Output
Chunks
Number of chunks created by the strategy. More chunks = finer granularity but higher indexing cost.
Metrics
- Average Recall: Percentage of relevant phrases successfully retrieved (0.0 to 1.0). Higher is better.
- Average Precision: Ratio of relevant content in retrieved chunks (0.0 to 1.0). Higher is better.
- Average F1-Score: Harmonic mean of precision and recall (0.0 to 1.0). Balanced measure of quality.
Interpreting scores:
- > 0.85: Excellent - strategy works very well for your content
- 0.70 - 0.85: Good - acceptable for most use cases
- 0.50 - 0.70: Fair - consider adjusting chunk size or strategy
- < 0.50: Poor - important information being lost or fragmented
Saved Location
Directory where chunks are written as individual .txt files for inspection.
Choosing the Right Strategy
| Strategy | Best For | Pros | Cons |
|---|---|---|---|
| Fixed-Size | Consistent chunk sizes, simple docs | Fast, deterministic | May break semantic boundaries |
| Sliding-Window | Preventing boundary loss | Preserves context at edges | Redundancy, more chunks |
| Paragraph | Well-structured docs | Preserves semantic coherence | Variable chunk sizes |
| Recursive-Character | General purpose | Good balance, semantic-aware | Requires LangChain |
| Header โญ | Technical docs, markdown | Preserves document structure | Requires header markup |
| Semantic โญ | Maximum retrieval quality | Best semantic coherence | Requires embeddings, slower |
Recommendations
- Technical documentation: Use
headerstrategy to preserve structure - Knowledge bases: Use
semanticfor best retrieval quality - General content: Start with
recursive-characterorparagraph - Token-limited models: Enable
--use-tiktokenfor accurate counting - Evaluation: Always use
--use-embeddingswith test files for better semantic matching
| Strategy | Best For | Chunk Size Recommendation |
|---|---|---|
| fixed-size | Uniform processing, consistent latency | 150-250 words |
| sliding-window | Preserving context at boundaries, dense text | 120-200 words, 20-30% overlap |
| paragraph | Well-structured docs with clear sections | N/A (variable) |
General guidelines:
- Start with paragraph for markdown with clear structure
- Use sliding-window if paragraphs are too long (>300 words)
- Use fixed-size as baseline for comparison
- Always test with representative questions from your domain
Extending
Add a new chunking strategy:
- Implement a function in
src/chunker.py:
def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
chunks = []
# Your logic here
chunks.append({"id": 0, "text": "chunk text"})
return chunks
- Register in
STRATEGIES:
STRATEGIES = {
"custom": lambda text, chunk_size=200, overlap=0: my_custom_chunks(text, chunk_size, overlap),
...
}
- Use via CLI:
rag-chunk analyze docs/ --strategy custom --chunk-size 180
Project Structure
rag-chunk/
โโโ src/
โ โโโ __init__.py
โ โโโ parser.py # Markdown parsing and cleaning
โ โโโ chunker.py # Chunking strategies
โ โโโ scorer.py # Retrieval and recall evaluation
โ โโโ cli.py # Command-line interface
โโโ tests/
โ โโโ test_basic.py # Unit tests
โโโ examples/
โ โโโ rag_introduction.md
โ โโโ chunking_strategies.md
โ โโโ evaluation_metrics.md
โ โโโ questions.json
โโโ .chunks/ # Generated chunks (gitignored)
โโโ pyproject.toml
โโโ README.md
โโโ .gitignore
License
MIT
Note on Tokenization
By default, --chunk-size and --overlap count words (whitespace-based tokenization). This keeps the tool simple and dependency-free.
For precise token-level chunking that matches LLM token counts (e.g., OpenAI GPT models using subword tokenization), use the --use-tiktoken flag after installing the optional dependency:
pip install rag-chunk[tiktoken]
rag-chunk analyze docs/ --strategy fixed-size --chunk-size 512 --use-tiktoken
See the Using Tiktoken section for more details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_chunk-0.4.0.tar.gz.
File metadata
- Download URL: rag_chunk-0.4.0.tar.gz
- Upload date:
- Size: 17.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cd9c405372128d55fba30e1a6d4bf9b79bbde14b930e4650ac9952ca04d94db
|
|
| MD5 |
eb82175126a80b22a09881c9f3452efa
|
|
| BLAKE2b-256 |
aa3a99ede533d9ad64cb99f07f0249a083b70d3d7473539d8f1c913d87cd594a
|
File details
Details for the file rag_chunk-0.4.0-py3-none-any.whl.
File metadata
- Download URL: rag_chunk-0.4.0-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ffe7e4e0d37c7657a2f8d2db3f4c4ba7d830c7123484adea9af6c7150848aae
|
|
| MD5 |
3777d87ff0b54da0c6c175ff9c45ebc6
|
|
| BLAKE2b-256 |
0a4b982638331f3bbcffa7e06a1ec69c3108154f51d78c23da01b1cc8725bd16
|