Skip to main content

A robust Markdown chunking library that preserves structure and context

Project description

Markdown Chunker

Python License: MIT Markdown PyPI package Version

A robust Python library for intelligently chunking Markdown documents while preserving structural integrity and maintaining context.

📋 Overview

Markdown Chunker helps you divide large Markdown files into smaller, more manageable chunks while preserving the structure, meaning, and context of the original document. It's ideal for:

  • Integrating long documents with AI models that have token limits
  • Creating semantic chunks for vector databases
  • Preparing content for efficient processing by NLP systems
  • Splitting documents for parallel processing while maintaining integrity

✨ Features

  • 🧠 Smart Chunking: Splits Markdown documents intelligently, preserving structure and meaning
  • 🔍 Content-Aware: Handles various Markdown elements with specialized intelligence:
    • Headings (never split)
    • Tables (split with headers preserved)
    • Code blocks (kept intact)
    • Lists (split between items)
    • Blockquotes (split at paragraph boundaries)
    • Footnotes (kept with their references when possible)
    • YAML Front Matter (kept intact)
    • HTML (preserves tag structure)
  • 🔄 Automatic Header/Footer Detection: Identifies and removes repeating headers and footers
  • 🚫 Duplicate Prevention: Automatically detects and removes duplicate chunks
  • ⚙️ Configurable Size Constraints: Customize minimum and maximum chunk sizes
  • 🏗️ Structure Preservation: Maintains Markdown syntax and document structure
  • 📝 Metadata Generation: Optionally adds metadata in each chunk as YAML front matter
  • ⚡ Parallel Processing: Efficiently processes large documents using multiple cores

🔧 Installation

# Install from PyPI (recommended)
pip install markdown-chunker

# Install the development version directly from GitHub
pip install git+https://github.com/hadjebi/markdown_chunker.git

🚀 Quick Start

from markdown_chunker import MarkdownChunkingStrategy

# Create a chunking strategy with default configuration
strategy = MarkdownChunkingStrategy(add_metadata=True)

# Or customize the parameters
strategy = MarkdownChunkingStrategy(
    min_chunk_len=512,    # Minimum chunk size (default: 512)
    soft_max_len=1024,    # Preferred maximum chunk size (default: 1024)
    hard_max_len=2048,    # Absolute maximum chunk size (default: 2048)
    detect_headers_footers=True,  # Detect and remove repeating headers/footers
    remove_duplicates=True,       # Remove duplicate chunks
    add_metadata=True             # Add metadata in each chunk as YAML front matter
)

# Chunk a Markdown document
with open('document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Process the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 80)

🖥️ Command Line Interface

The package includes a powerful command-line interface for chunking markdown files:

# Use the markdown-chunker command after installing with pip
markdown-chunker --add-metadata examples/sample.md

# Specify a custom output directory
markdown-chunker --add-metadata examples/sample.md custom_output_dir

# Customize parameters
markdown-chunker --add-metadata --min-chunk-len=256 --soft-max-len=512 --hard-max-len=1024 examples/sample.md

# Enable parallel processing for large documents
markdown-chunker --add-metadata --parallel --max-workers=4 examples/large_document.md

Available Options:

  • --min-chunk-len: Minimum chunk length in characters (default: 512)
  • --soft-max-len: Soft maximum chunk length in characters (default: 1024)
  • --hard-max-len: Hard maximum chunk length in characters (default: 2048)
  • --no-headers-footers: Disable header and footer detection
  • --no-duplicates: Disable duplicate detection
  • --add-metadata: Add metadata in each chunk as YAML front matter
  • --document-title: Specify a document title for metadata (auto-detected if not provided)
  • --parallel: Enable parallel processing for large documents
  • --max-workers: Maximum number of worker processes for parallel processing
  • --verbose: Enable verbose output

🔬 Chunking Strategy

The library implements a sophisticated chunking strategy that follows these rules:

  1. Structure Preservation

    • Headings are never split
    • Code blocks are kept intact
    • Tables are split only when necessary, with headers preserved
    • Lists are split between items to maintain structure
    • Blockquotes are split at paragraph boundaries
    • Footnotes are kept with their references when possible
    • HTML tags are preserved in their structure
  2. Size Management

    • Chunks are kept between min_chunk_len and soft_max_len when possible
    • Content is never split beyond hard_max_len
    • Small chunks are merged when below min_chunk_len
  3. Header/Footer Handling

    • Automatically detects repeating headers and footers
    • Removes redundant elements while preserving unique content
    • Uses pattern matching to identify common elements
  4. Duplicate Prevention

    • Detects and removes duplicate chunks
    • Preserves the first occurrence of duplicate content
    • Uses MD5 hashing for efficient comparison
  5. Adds Metadata

    • Optionally adds metadata in each chunk as YAML front matter
    • Includes document information (title, source)
    • Provides chunk details (id, position, next/previous chunks)
    • Maintains heading hierarchy information
    • Identifies content types (tables, lists, code blocks, etc.)
    • Preserves and merges with existing YAML front matter

📝 Examples

Basic Usage

from markdown_chunker import MarkdownChunkingStrategy

strategy = MarkdownChunkingStrategy(add_metadata=True)

# Simple document with various elements
content = """
# Main Title

## Section 1

This is a paragraph with some content.

```python
def example():
    return "Hello, World!"
  1. First item
  2. Second item
    • Subitem
    • Another subitem

Important quote spanning multiple lines """

chunks = strategy.chunk_markdown(content)


### Custom Configuration

```python
from markdown_chunker import MarkdownChunkingStrategy

# Create a strategy with custom parameters
strategy = MarkdownChunkingStrategy(
    min_chunk_len=100,
    soft_max_len=200,
    hard_max_len=300,
    detect_headers_footers=False,  # Disable header/footer detection
    add_metadata=True              # Enable metadata adding
)

chunks = strategy.chunk_markdown(content)

Added Metadata

from markdown_chunker import MarkdownChunkingStrategy

# Create a strategy with added metadata
strategy = MarkdownChunkingStrategy(
    add_metadata=True,
    document_title="My Document",
    source_document="document.md"
)

with open('document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Each chunk will include YAML front matter like:
'''
---
chunk:
  id: 1
  total: 10
  previous: null
  next: 2
  length: 1024
  position: 10%
document:
  title: My Document
  source: document.md
content:
  types:
  - heading
  - paragraph
  word_count: 180
  characters: 1024
headings:
  main: Section 1
  all:
  - Main Title
  - Section 1
---

# Section 1

This is the beginning of section 1...
'''

Processing Large Documents

from markdown_chunker import MarkdownChunkingStrategy
import os

# Create a strategy with parallel processing for large documents
strategy = MarkdownChunkingStrategy(
    parallel_processing=True,
    max_workers=4,  # Number of worker processes
    add_metadata=True  # Include metadata in each chunk
)

# Process a large document
with open('large_document.md', 'r') as f:
    content = f.read()

chunks = strategy.chunk_markdown(content)

# Save chunks to files
os.makedirs('output', exist_ok=True)
for i, chunk in enumerate(chunks):
    with open(f'output/chunk_{i+1:03d}.md', 'w') as f:
        f.write(chunk)

📂 Sample Output

Example output directories are included in the repository:

  • examples/outputs/basic_example/: Basic chunking with metadata
  • examples/outputs/metadata_example/: Chunking with enhanced metadata
  • examples/outputs/custom_params_example/: Chunking with custom size parameters and metadata
  • examples/outputs/bmw_example/: Chunking of a large document (BMW Annual Report) with parallel processing and metadata

🔍 Advanced Usage

Parallel Processing

For large documents, you can enable parallel processing to significantly improve performance:

strategy = MarkdownChunkingStrategy(
    parallel_processing=True,
    max_workers=4  # Number of worker processes
)

Custom Content Handlers

The library is designed to be extensible. You can create custom content handlers for specialized Markdown elements:

from markdown_chunker import ContentHandler
from markdown_chunker.utils import is_special_element

class CustomElementHandler(ContentHandler):
    def can_handle(self, content):
        return is_special_element(content)
        
    def split(self, content, max_length):
        # Custom splitting logic
        return split_parts

# Add to strategy
strategy = MarkdownChunkingStrategy(add_metadata=True)
strategy.content_handlers.append(CustomElementHandler())

🤝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch: git checkout -b new-feature
  3. Make your changes and commit: git commit -m 'Add new feature'
  4. Push to your branch: git push origin new-feature
  5. Create a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Documentation

For complete documentation, see the docs directory.

🙏 Acknowledgements

  • Inspired by the needs of AI developers working with large documents
  • Built upon the shoulders of the Python Markdown ecosystem

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_chunker-0.1.3.tar.gz (90.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdown_chunker-0.1.3-py3-none-any.whl (27.3 kB view details)

Uploaded Python 3

File details

Details for the file markdown_chunker-0.1.3.tar.gz.

File metadata

  • Download URL: markdown_chunker-0.1.3.tar.gz
  • Upload date:
  • Size: 90.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for markdown_chunker-0.1.3.tar.gz
Algorithm Hash digest
SHA256 cc8770543a10c35ce253e5779c07c730d0e6c7c0a012e78a3af8402bb25c014b
MD5 ea36e27a1f4603c0d7024313c4606847
BLAKE2b-256 f7097dc5ec81c2bae6ef57cb2763ee8b044c1d873184847cd9cda134a83236c7

See more details on using hashes here.

File details

Details for the file markdown_chunker-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for markdown_chunker-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 77da642fa59680c8696157a0c12cddbd6b30b9a57acb178d577150f4c7b752d6
MD5 c0ca980aed8916a1d745c16718faffb0
BLAKE2b-256 4a81daaea4e64e91b7b15689d26b4222650af85f3555555a9e4f6a033c452f1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page