Skip to main content

Intelligent Markdown chunking library for RAG systems

Project description

Chunkana

A semantic Markdown chunker that preserves document structure for RAG and LLM pipelines. Never breaks code blocks, tables, or headers—every chunk stays semantically complete.

GitHub Repository PyPI version Python versions License Downloads

Quick Start

pip install chunkana
from chunkana import chunk_markdown

text = """
# My Document

## Section One
Some content here.

## Section Two
More content with code:

```python
def hello():
    print("Hello!")

"""

chunks = chunk_markdown(text) for chunk in chunks: print(f"Lines {chunk.start_line}-{chunk.end_line}: {chunk.metadata['header_path']}") print(f"Content: {chunk.content[:100]}...")


## Why Chunkana?

**Problem**: Traditional splitters break Markdown structure, fragmenting code blocks, tables, and lists.

**Solution**: Chunkana preserves semantic boundaries while providing rich metadata for retrieval:

- ✅ **Never breaks** code blocks, tables, or LaTeX formulas
- ✅ **Preserves hierarchy** with header paths like `/Introduction/Overview`
- ✅ **Rich metadata** for filtering, ranking, and context
- ✅ **Streaming support** for large documents
- ✅ **Multiple output formats** (JSON, Dify-compatible, etc.)

## Key Features

- **Semantic preservation**: Headers, lists, tables, code blocks, and LaTeX stay intact
- **Smart strategies**: Auto-selects optimal chunking approach per document
- **Hierarchical navigation**: Build chunk trees for section-aware retrieval
- **Overlap metadata**: Context continuity without content duplication
- **Memory efficient**: Stream large files without loading everything into RAM

## Usage Examples

### Basic Configuration

```python
from chunkana import chunk_markdown, ChunkConfig

config = ChunkConfig(
    max_chunk_size=2048,
    min_chunk_size=256,
    overlap_size=100,
)

chunks = chunk_markdown(text, config)

Hierarchical Chunking

from chunkana import MarkdownChunker, ChunkConfig

chunker = MarkdownChunker(ChunkConfig(validate_invariants=True))
result = chunker.chunk_hierarchical(text)

# Get leaf chunks for indexing
flat_chunks = result.get_flat_chunks()

# Navigate the hierarchy
root = result.get_chunk(result.root_id)
children = result.get_children(result.root_id)

Streaming Large Documents

from chunkana import MarkdownChunker

chunker = MarkdownChunker()
for chunk in chunker.chunk_file_streaming("large_document.md"):
    print(f"Chunk {chunk.metadata['chunk_index']}: {chunk.size} chars")

Output Formats

from chunkana.renderers import render_json, render_dify_style

chunks = chunk_markdown(text)

# JSON format
json_output = render_json(chunks)

# Dify-compatible format
dify_output = render_dify_style(chunks)

Metadata Schema

Each chunk includes rich metadata for retrieval:

{
    "content": "# Section\nContent here...",
    "start_line": 1,
    "end_line": 10,
    "size": 156,
    "metadata": {
        "chunk_index": 0,
        "content_type": "section",
        "header_path": "/Introduction/Overview",
        "header_level": 2,
        "strategy": "structural",
        "has_code": false,
        "overlap_size": 100
    }
}

Requirements

  • Python 3.12+
  • No external dependencies for core functionality
  • Optional: pip install "chunkana[docs]" for documentation tools

Integrations

  • Dify: Direct compatibility with Dify workflows
  • n8n: Automation pipeline integration
  • Windmill: Batch processing workflows

Documentation

Contributing

We welcome contributions! See CONTRIBUTING.md for:

  • Development setup
  • Code style guidelines
  • Testing procedures
  • Pull request process

License

MIT License - see LICENSE for details.


Need help? Check the documentation or open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkana-0.1.6.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkana-0.1.6-py3-none-any.whl (94.3 kB view details)

Uploaded Python 3

File details

Details for the file chunkana-0.1.6.tar.gz.

File metadata

  • Download URL: chunkana-0.1.6.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for chunkana-0.1.6.tar.gz
Algorithm Hash digest
SHA256 b04e73b0f3c5351b642557f3447760516f51b1969661623f8f80d64fd8ace512
MD5 64e6de754c187c48f92d0768a6726ad5
BLAKE2b-256 09e3cbddee5ae4050438d1c70a22f8ab1a6ff360421fb25750fbb67b1a73442c

See more details on using hashes here.

File details

Details for the file chunkana-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: chunkana-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 94.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for chunkana-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3dc6d61265aab8e544cf32dfc1a9b668624559b37f194c48e55ee9892ecc14f3
MD5 59391aec1f943109c2063d5d43cb9ce6
BLAKE2b-256 126be7443960c99e2f9daf073ea8ebd5b5b391fa8d39ee54ae6c75d302cab613

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page