Intelligent Markdown chunking library for RAG systems
Project description
Chunkana
A semantic Markdown chunker that preserves document structure for RAG and LLM pipelines. Never breaks code blocks, tables, or headers—every chunk stays semantically complete.
Quick Start
pip install chunkana
from chunkana import chunk_markdown
text = """
# My Document
## Section One
Some content here.
## Section Two
More content with code:
```python
def hello():
print("Hello!")
"""
chunks = chunk_markdown(text) for chunk in chunks: print(f"Lines {chunk.start_line}-{chunk.end_line}: {chunk.metadata['header_path']}") print(f"Content: {chunk.content[:100]}...")
## Why Chunkana?
**Problem**: Traditional splitters break Markdown structure, fragmenting code blocks, tables, and lists.
**Solution**: Chunkana preserves semantic boundaries while providing rich metadata for retrieval:
- ✅ **Never breaks** code blocks, tables, or LaTeX formulas
- ✅ **Preserves hierarchy** with header paths like `/Introduction/Overview`
- ✅ **Rich metadata** for filtering, ranking, and context
- ✅ **Streaming support** for large documents
- ✅ **Multiple output formats** (JSON, Dify-compatible, etc.)
## Key Features
- **Semantic preservation**: Headers, lists, tables, code blocks, and LaTeX stay intact
- **Smart strategies**: Auto-selects optimal chunking approach per document
- **Hierarchical navigation**: Build chunk trees for section-aware retrieval
- **Overlap metadata**: Context continuity without content duplication
- **Memory efficient**: Stream large files without loading everything into RAM
## Usage Examples
### Basic Configuration
```python
from chunkana import chunk_markdown, ChunkConfig
config = ChunkConfig(
max_chunk_size=2048,
min_chunk_size=256,
overlap_size=100,
)
chunks = chunk_markdown(text, config)
Hierarchical Chunking
from chunkana import MarkdownChunker, ChunkConfig
chunker = MarkdownChunker(ChunkConfig(validate_invariants=True))
result = chunker.chunk_hierarchical(text)
# Get leaf chunks for indexing
flat_chunks = result.get_flat_chunks()
# Navigate the hierarchy
root = result.get_chunk(result.root_id)
children = result.get_children(result.root_id)
Streaming Large Documents
from chunkana import MarkdownChunker
chunker = MarkdownChunker()
for chunk in chunker.chunk_file_streaming("large_document.md"):
print(f"Chunk {chunk.metadata['chunk_index']}: {chunk.size} chars")
Output Formats
from chunkana.renderers import render_json, render_dify_style
chunks = chunk_markdown(text)
# JSON format
json_output = render_json(chunks)
# Dify-compatible format
dify_output = render_dify_style(chunks)
Metadata Schema
Each chunk includes rich metadata for retrieval:
{
"content": "# Section\nContent here...",
"start_line": 1,
"end_line": 10,
"size": 156,
"metadata": {
"chunk_index": 0,
"content_type": "section",
"header_path": "/Introduction/Overview",
"header_level": 2,
"strategy": "structural",
"has_code": false,
"overlap_size": 100
}
}
Requirements
- Python 3.12+
- No external dependencies for core functionality
- Optional:
pip install "chunkana[docs]"for documentation tools
Integrations
- Dify: Direct compatibility with Dify workflows
- n8n: Automation pipeline integration
- Windmill: Batch processing workflows
Documentation
- Quick Start Guide - Get started in minutes
- Configuration - All configuration options
- Strategies - How chunking strategies work
- Renderers - Output format options
- API Reference - Complete API documentation
Contributing
We welcome contributions! See CONTRIBUTING.md for:
- Development setup
- Code style guidelines
- Testing procedures
- Pull request process
License
MIT License - see LICENSE for details.
Need help? Check the documentation or open an issue.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunkana-0.1.6.tar.gz.
File metadata
- Download URL: chunkana-0.1.6.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b04e73b0f3c5351b642557f3447760516f51b1969661623f8f80d64fd8ace512
|
|
| MD5 |
64e6de754c187c48f92d0768a6726ad5
|
|
| BLAKE2b-256 |
09e3cbddee5ae4050438d1c70a22f8ab1a6ff360421fb25750fbb67b1a73442c
|
File details
Details for the file chunkana-0.1.6-py3-none-any.whl.
File metadata
- Download URL: chunkana-0.1.6-py3-none-any.whl
- Upload date:
- Size: 94.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dc6d61265aab8e544cf32dfc1a9b668624559b37f194c48e55ee9892ecc14f3
|
|
| MD5 |
59391aec1f943109c2063d5d43cb9ce6
|
|
| BLAKE2b-256 |
126be7443960c99e2f9daf073ea8ebd5b5b391fa8d39ee54ae6c75d302cab613
|