A robust Markdown chunking library that preserves structure and context
Project description
Markdown Chunker
A robust Python library for intelligently chunking Markdown documents while preserving structural integrity and maintaining context.
📋 Overview
Markdown Chunker helps you divide large Markdown files into smaller, more manageable chunks while preserving the structure, meaning, and context of the original document. It's ideal for:
- Integrating long documents with AI models that have token limits
- Creating semantic chunks for vector databases
- Preparing content for efficient processing by NLP systems
- Splitting documents for parallel processing while maintaining integrity
✨ Features
- 🧠 Smart Chunking: Splits Markdown documents intelligently, preserving structure and meaning
- 🔍 Content-Aware: Handles various Markdown elements with specialized intelligence:
- Headings (never split)
- Tables (split with headers preserved)
- Code blocks (kept intact)
- Lists (split between items)
- Blockquotes (split at paragraph boundaries)
- Footnotes (kept with their references when possible)
- YAML Front Matter (kept intact)
- HTML (preserves tag structure)
- 🔄 Automatic Header/Footer Detection: Identifies and removes repeating headers and footers
- 🚫 Duplicate Prevention: Automatically detects and removes duplicate chunks
- ⚙️ Configurable Size Constraints: Customize minimum and maximum chunk sizes
- 🏗️ Structure Preservation: Maintains Markdown syntax and document structure
- 📝 Metadata Generation: Optionally adds metadata in each chunk as YAML front matter
- ⚡ Parallel Processing: Efficiently processes large documents using multiple cores
🔧 Installation
# Install from PyPI (recommended)
pip install markdown-chunker
# Install the development version directly from GitHub
pip install git+https://github.com/hadjebi/markdown_chunker.git
🚀 Quick Start
from markdown_chunker import MarkdownChunkingStrategy
# Create a chunking strategy with default configuration
strategy = MarkdownChunkingStrategy(add_metadata=True)
# Or customize the parameters
strategy = MarkdownChunkingStrategy(
min_chunk_len=512, # Minimum chunk size (default: 512)
soft_max_len=1024, # Preferred maximum chunk size (default: 1024)
hard_max_len=2048, # Absolute maximum chunk size (default: 2048)
detect_headers_footers=True, # Detect and remove repeating headers/footers
remove_duplicates=True, # Remove duplicate chunks
add_metadata=True # Add metadata in each chunk as YAML front matter
)
# Chunk a Markdown document
with open('document.md', 'r') as f:
content = f.read()
chunks = strategy.chunk_markdown(content)
# Process the chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i + 1}:")
print(chunk)
print("-" * 80)
🖥️ Command Line Interface
The package includes a powerful command-line interface for chunking markdown files:
# Use the markdown-chunker command after installing with pip
markdown-chunker --add-metadata examples/sample.md
# Specify a custom output directory
markdown-chunker --add-metadata examples/sample.md custom_output_dir
# Customize parameters
markdown-chunker --add-metadata --min-chunk-len=256 --soft-max-len=512 --hard-max-len=1024 examples/sample.md
# Enable parallel processing for large documents
markdown-chunker --add-metadata --parallel --max-workers=4 examples/large_document.md
Available Options:
--min-chunk-len: Minimum chunk length in characters (default: 512)--soft-max-len: Soft maximum chunk length in characters (default: 1024)--hard-max-len: Hard maximum chunk length in characters (default: 2048)--no-headers-footers: Disable header and footer detection--no-duplicates: Disable duplicate detection--add-metadata: Add metadata in each chunk as YAML front matter--document-title: Specify a document title for metadata (auto-detected if not provided)--parallel: Enable parallel processing for large documents--max-workers: Maximum number of worker processes for parallel processing--verbose: Enable verbose output
🔬 Chunking Strategy
The library implements a sophisticated chunking strategy that follows these rules:
-
Structure Preservation
- Headings are never split
- Code blocks are kept intact
- Tables are split only when necessary, with headers preserved
- Lists are split between items to maintain structure
- Blockquotes are split at paragraph boundaries
- Footnotes are kept with their references when possible
- HTML tags are preserved in their structure
-
Size Management
- Chunks are kept between
min_chunk_lenandsoft_max_lenwhen possible - Content is never split beyond
hard_max_len - Small chunks are merged when below
min_chunk_len
- Chunks are kept between
-
Header/Footer Handling
- Automatically detects repeating headers and footers
- Removes redundant elements while preserving unique content
- Uses pattern matching to identify common elements
-
Duplicate Prevention
- Detects and removes duplicate chunks
- Preserves the first occurrence of duplicate content
- Uses MD5 hashing for efficient comparison
-
Adds Metadata
- Optionally adds metadata in each chunk as YAML front matter
- Includes document information (title, source)
- Provides chunk details (id, position, next/previous chunks)
- Maintains heading hierarchy information
- Identifies content types (tables, lists, code blocks, etc.)
- Preserves and merges with existing YAML front matter
📝 Examples
Basic Usage
from markdown_chunker import MarkdownChunkingStrategy
strategy = MarkdownChunkingStrategy(add_metadata=True)
# Simple document with various elements
content = """
# Main Title
## Section 1
This is a paragraph with some content.
```python
def example():
return "Hello, World!"
- First item
- Second item
- Subitem
- Another subitem
Important quote spanning multiple lines """
chunks = strategy.chunk_markdown(content)
### Custom Configuration
```python
from markdown_chunker import MarkdownChunkingStrategy
# Create a strategy with custom parameters
strategy = MarkdownChunkingStrategy(
min_chunk_len=100,
soft_max_len=200,
hard_max_len=300,
detect_headers_footers=False, # Disable header/footer detection
add_metadata=True # Enable metadata adding
)
chunks = strategy.chunk_markdown(content)
Added Metadata
from markdown_chunker import MarkdownChunkingStrategy
# Create a strategy with added metadata
strategy = MarkdownChunkingStrategy(
add_metadata=True,
document_title="My Document",
source_document="document.md"
)
with open('document.md', 'r') as f:
content = f.read()
chunks = strategy.chunk_markdown(content)
# Each chunk will include YAML front matter like:
'''
---
chunk:
id: 1
total: 10
previous: null
next: 2
length: 1024
position: 10%
document:
title: My Document
source: document.md
content:
types:
- heading
- paragraph
word_count: 180
characters: 1024
headings:
main: Section 1
all:
- Main Title
- Section 1
---
# Section 1
This is the beginning of section 1...
'''
Processing Large Documents
from markdown_chunker import MarkdownChunkingStrategy
import os
# Create a strategy with parallel processing for large documents
strategy = MarkdownChunkingStrategy(
parallel_processing=True,
max_workers=4, # Number of worker processes
add_metadata=True # Include metadata in each chunk
)
# Process a large document
with open('large_document.md', 'r') as f:
content = f.read()
chunks = strategy.chunk_markdown(content)
# Save chunks to files
os.makedirs('output', exist_ok=True)
for i, chunk in enumerate(chunks):
with open(f'output/chunk_{i+1:03d}.md', 'w') as f:
f.write(chunk)
📂 Sample Output
Example output directories are included in the repository:
examples/outputs/basic_example/: Basic chunking with metadataexamples/outputs/metadata_example/: Chunking with enhanced metadataexamples/outputs/custom_params_example/: Chunking with custom size parameters and metadataexamples/outputs/bmw_example/: Chunking of a large document (BMW Annual Report) with parallel processing and metadata
🔍 Advanced Usage
Parallel Processing
For large documents, you can enable parallel processing to significantly improve performance:
strategy = MarkdownChunkingStrategy(
parallel_processing=True,
max_workers=4 # Number of worker processes
)
Custom Content Handlers
The library is designed to be extensible. You can create custom content handlers for specialized Markdown elements:
from markdown_chunker import ContentHandler
from markdown_chunker.utils import is_special_element
class CustomElementHandler(ContentHandler):
def can_handle(self, content):
return is_special_element(content)
def split(self, content, max_length):
# Custom splitting logic
return split_parts
# Add to strategy
strategy = MarkdownChunkingStrategy(add_metadata=True)
strategy.content_handlers.append(CustomElementHandler())
🤝 Contributing
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b new-feature - Make your changes and commit:
git commit -m 'Add new feature' - Push to your branch:
git push origin new-feature - Create a pull request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
📚 Documentation
For complete documentation, see the docs directory.
🙏 Acknowledgements
- Inspired by the needs of AI developers working with large documents
- Built upon the shoulders of the Python Markdown ecosystem
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdown_chunker-0.1.2.tar.gz.
File metadata
- Download URL: markdown_chunker-0.1.2.tar.gz
- Upload date:
- Size: 90.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
537261ef7d8471c82bb49998c1c9911ae02ed9e34c47de267fcba8ecd03dd858
|
|
| MD5 |
f6a1be6d83118b1d50cc0665a9c7d23b
|
|
| BLAKE2b-256 |
6aeccfc6a28961b622eebb31a4bac1335d5100c3da4364466cdcec78483069dc
|
File details
Details for the file markdown_chunker-0.1.2-py3-none-any.whl.
File metadata
- Download URL: markdown_chunker-0.1.2-py3-none-any.whl
- Upload date:
- Size: 27.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69c5523ef766392f3480575d438a8e184a0ee7a1492a7fe25d6a1271bedb64ed
|
|
| MD5 |
c85b897537db4e59af13b168d245ba7f
|
|
| BLAKE2b-256 |
36166ee7c68d84d59d65639c88b78d62c29f2ca85e02cdecd13c91e059c13c3f
|