A robust Markdown chunking library that preserves structure and context

These details have not been verified by PyPI

Project links

Homepage

Project description

Markdown Chunker

A robust Python library for intelligently chunking Markdown documents while preserving structural integrity and maintaining context.

📋 Overview

Markdown Chunker helps you divide large Markdown files into smaller, more manageable chunks while preserving the structure, meaning, and context of the original document. It's ideal for:

Integrating long documents with AI models that have token limits
Creating semantic chunks for vector databases
Preparing content for efficient processing by NLP systems
Splitting documents for parallel processing while maintaining integrity

✨ Features

🧠 Smart Chunking: Splits Markdown documents intelligently, preserving structure and meaning
🔍 Content-Aware: Handles various Markdown elements with specialized intelligence:
- Headings (never split)
- Tables (split with headers preserved)
- Code blocks (kept intact)
- Lists (split between items)
- Blockquotes (split at paragraph boundaries)
- Footnotes (kept with their references when possible)
- YAML Front Matter (kept intact)
- HTML (preserves tag structure)
🔄 Automatic Header/Footer Detection: Identifies and removes repeating headers and footers
🚫 Duplicate Prevention: Automatically detects and removes duplicate chunks
⚙️ Configurable Size Constraints: Customize minimum and maximum chunk sizes
🏗️ Structure Preservation: Maintains Markdown syntax and document structure
📝 Metadata Generation: Optionally adds metadata in each chunk as YAML front matter
⚡ Parallel Processing: Efficiently processes large documents using multiple cores

🔧 Installation

# Install from PyPI (recommended)
pip install markdown-chunker

# Install the development version directly from GitHub
pip install git+https://github.com/hadjebi/markdown_chunker.git

🚀 Quick Start

from markdown_chunker import MarkdownChunkingStrategy

# Create a chunking strategy with default configuration
strategy = MarkdownChunkingStrategy(add_metadata=True)

# Or customize the parameters
strategy = MarkdownChunkingStrategy(
    min_chunk_len=512,    # Minimum chunk size (default: 512)
    soft_max_len=1024,    # Preferred maximum chunk size (default: 1024)
    hard_max_len=2048,    # Absolute maximum chunk size (default: 2048)
    detect_headers_footers=True,  # Detect and remove repeating headers/footers
    remove_duplicates=True,       # Remove duplicate chunks
    add_metadata=True             # Add metadata in each chunk as YAML front matter
)

# Chunk a Markdown document
with open('document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Process the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 80)

🖥️ Command Line Interface

The package includes a powerful command-line interface for chunking markdown files:

# Use the markdown-chunker command after installing with pip
markdown-chunker --add-metadata examples/sample.md

# Specify a custom output directory
markdown-chunker --add-metadata examples/sample.md custom_output_dir

# Customize parameters
markdown-chunker --add-metadata --min-chunk-len=256 --soft-max-len=512 --hard-max-len=1024 examples/sample.md

# Enable parallel processing for large documents
markdown-chunker --add-metadata --parallel --max-workers=4 examples/large_document.md

Available Options:

--min-chunk-len: Minimum chunk length in characters (default: 512)
--soft-max-len: Soft maximum chunk length in characters (default: 1024)
--hard-max-len: Hard maximum chunk length in characters (default: 2048)
--no-headers-footers: Disable header and footer detection
--no-duplicates: Disable duplicate detection
--add-metadata: Add metadata in each chunk as YAML front matter
--document-title: Specify a document title for metadata (auto-detected if not provided)
--parallel: Enable parallel processing for large documents
--max-workers: Maximum number of worker processes for parallel processing
--verbose: Enable verbose output

🔬 Chunking Strategy

The library implements a sophisticated chunking strategy that follows these rules:

Structure Preservation
- Headings are never split
- Code blocks are kept intact
- Tables are split only when necessary, with headers preserved
- Lists are split between items to maintain structure
- Blockquotes are split at paragraph boundaries
- Footnotes are kept with their references when possible
- HTML tags are preserved in their structure
Size Management
- Chunks are kept between min_chunk_len and soft_max_len when possible
- Content is never split beyond hard_max_len
- Small chunks are merged when below min_chunk_len
Header/Footer Handling
- Automatically detects repeating headers and footers
- Removes redundant elements while preserving unique content
- Uses pattern matching to identify common elements
Duplicate Prevention
- Detects and removes duplicate chunks
- Preserves the first occurrence of duplicate content
- Uses MD5 hashing for efficient comparison
Adds Metadata
- Optionally adds metadata in each chunk as YAML front matter
- Includes document information (title, source)
- Provides chunk details (id, position, next/previous chunks)
- Maintains heading hierarchy information
- Identifies content types (tables, lists, code blocks, etc.)
- Preserves and merges with existing YAML front matter

📝 Examples

Basic Usage

from markdown_chunker import MarkdownChunkingStrategy

strategy = MarkdownChunkingStrategy(add_metadata=True)

# Simple document with various elements
content = """
# Main Title

## Section 1

This is a paragraph with some content.

```python
def example():
    return "Hello, World!"

First item
Second item
- Subitem
- Another subitem

Important quote spanning multiple lines """

chunks = strategy.chunk_markdown(content)


### Custom Configuration

```python
from markdown_chunker import MarkdownChunkingStrategy

# Create a strategy with custom parameters
strategy = MarkdownChunkingStrategy(
    min_chunk_len=100,
    soft_max_len=200,
    hard_max_len=300,
    detect_headers_footers=False,  # Disable header/footer detection
    add_metadata=True              # Enable metadata adding
)

chunks = strategy.chunk_markdown(content)

Added Metadata

from markdown_chunker import MarkdownChunkingStrategy

# Create a strategy with added metadata
strategy = MarkdownChunkingStrategy(
    add_metadata=True,
    document_title="My Document",
    source_document="document.md"
)

with open('document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Each chunk will include YAML front matter like:
'''
---
chunk:
  id: 1
  total: 10
  previous: null
  next: 2
  length: 1024
  position: 10%
document:
  title: My Document
  source: document.md
content:
  types:
  - heading
  - paragraph
  word_count: 180
  characters: 1024
headings:
  main: Section 1
  all:
  - Main Title
  - Section 1
---

# Section 1

This is the beginning of section 1...
'''

Processing Large Documents

from markdown_chunker import MarkdownChunkingStrategy
import os

# Create a strategy with parallel processing for large documents
strategy = MarkdownChunkingStrategy(
    parallel_processing=True,
    max_workers=4,  # Number of worker processes
    add_metadata=True  # Include metadata in each chunk
)

# Process a large document
with open('large_document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Save chunks to files
os.makedirs('output', exist_ok=True)
for i, chunk in enumerate(chunks):
    with open(f'output/chunk_{i+1:03d}.md', 'w') as f:
        f.write(chunk)

📂 Sample Output

Example output directories are included in the repository:

examples/outputs/basic_example/: Basic chunking with metadata
examples/outputs/metadata_example/: Chunking with enhanced metadata
examples/outputs/custom_params_example/: Chunking with custom size parameters and metadata
examples/outputs/bmw_example/: Chunking of a large document (BMW Annual Report) with parallel processing and metadata

🔍 Advanced Usage

Parallel Processing

For large documents, you can enable parallel processing to significantly improve performance:

strategy = MarkdownChunkingStrategy(
    parallel_processing=True,
    max_workers=4  # Number of worker processes
)

Custom Content Handlers

The library is designed to be extensible. You can create custom content handlers for specialized Markdown elements:

from markdown_chunker import ContentHandler
from markdown_chunker.utils import is_special_element

class CustomElementHandler(ContentHandler):
    def can_handle(self, content):
        return is_special_element(content)
        
    def split(self, content, max_length):
        # Custom splitting logic
        return split_parts

# Add to strategy
strategy = MarkdownChunkingStrategy(add_metadata=True)
strategy.content_handlers.append(CustomElementHandler())

🤝 Contributing

Contributions are welcome! Here's how you can help:

Fork the repository
Create a feature branch: git checkout -b new-feature
Make your changes and commit: git commit -m 'Add new feature'
Push to your branch: git push origin new-feature
Create a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Documentation

For complete documentation, see the docs directory.

🙏 Acknowledgements

Inspired by the needs of AI developers working with large documents
Built upon the shoulders of the Python Markdown ecosystem

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.3

Feb 25, 2025

This version

0.1.2

Feb 25, 2025

0.1.1

Feb 25, 2025

0.1.0

Feb 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_chunker-0.1.2.tar.gz (90.5 kB view details)

Uploaded Feb 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markdown_chunker-0.1.2-py3-none-any.whl (27.3 kB view details)

Uploaded Feb 25, 2025 Python 3

File details

Details for the file markdown_chunker-0.1.2.tar.gz.

File metadata

Download URL: markdown_chunker-0.1.2.tar.gz
Upload date: Feb 25, 2025
Size: 90.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for markdown_chunker-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`537261ef7d8471c82bb49998c1c9911ae02ed9e34c47de267fcba8ecd03dd858`
MD5	`f6a1be6d83118b1d50cc0665a9c7d23b`
BLAKE2b-256	`6aeccfc6a28961b622eebb31a4bac1335d5100c3da4364466cdcec78483069dc`

See more details on using hashes here.

File details

Details for the file markdown_chunker-0.1.2-py3-none-any.whl.

File metadata

Download URL: markdown_chunker-0.1.2-py3-none-any.whl
Upload date: Feb 25, 2025
Size: 27.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for markdown_chunker-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`69c5523ef766392f3480575d438a8e184a0ee7a1492a7fe25d6a1271bedb64ed`
MD5	`c85b897537db4e59af13b168d245ba7f`
BLAKE2b-256	`36166ee7c68d84d59d65639c88b78d62c29f2ca85e02cdecd13c91e059c13c3f`

See more details on using hashes here.

markdown-chunker 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Markdown Chunker

📋 Overview

✨ Features

🔧 Installation

🚀 Quick Start

🖥️ Command Line Interface

Available Options:

🔬 Chunking Strategy

📝 Examples

Basic Usage

Added Metadata

Processing Large Documents

📂 Sample Output

🔍 Advanced Usage

Parallel Processing

Custom Content Handlers

🤝 Contributing

📄 License

📚 Documentation

🙏 Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes