Convert documents to markdown, chunk them intelligently, and export structured data
Project description
MarkitDown Chunker
A powerful Python package that converts documents to markdown, intelligently chunks them, and exports structured data. Built as an add-on to the markitdown package with advanced chunking capabilities using LangChain.
✨ Features
- 📄 Multi-format Support: Convert PDF, DOCX, PPTX, XLSX, HTML, RTF, ODT, and more to markdown
- 🖼️ Image Extraction: Automatically extract images from PDF, DOCX, PPTX files (requires optional dependencies)
- 🎨 Image Summarization: Optional AI-powered image descriptions for better context
- ✂️ Smart Chunking: Markdown-aware text splitting that respects document structure
- 📊 Structured Export: Export chunks with metadata to JSON format
- 🔧 Flexible Pipeline: Run individual steps or complete pipeline as needed
- 🎯 CLI & Python API: Use from command line or integrate into your Python applications
📦 Installation
Basic Installation
pip install markitdown-chunker
With Image Extraction Support
To extract images from PDF, DOCX, and PPTX files:
pip install "markitdown-chunker[images]"
See Image Extraction Guide for details.
From Source
git clone https://github.com/Naveenkumarar/markitdown-chunker.git
cd markitdown-chunker
pip install -e .
# Or with image support:
pip install -e ".[images]"
🚀 Quick Start
Command Line Interface
# Convert, chunk, and export (full pipeline)
markitdown-chunker input.pdf output/
# Convert only
markitdown-chunker document.docx output/ --convert-only
# Chunk existing markdown
markitdown-chunker document.md output/ --chunk-only
# Custom chunk size and overlap
markitdown-chunker input.pdf output/ --chunk-size 2000 --overlap 400
# List supported formats
markitdown-chunker --list-formats
Python API
Complete Pipeline
from markitdown_chunker import MarkitDownProcessor
# Initialize processor with custom settings
processor = MarkitDownProcessor(
chunk_size=1000,
chunk_overlap=200,
use_markdown_splitter=True
)
# Process a document (all steps)
result = processor.process(
file_path="document.pdf",
output_dir="output/"
)
print(f"Markdown saved to: {result['conversion']['markdown_path']}")
print(f"Created {len(result['chunking']['chunks'])} chunks")
print(f"JSON exported to: {result['export']['json_path']}")
Step-by-Step Processing
from markitdown_chunker import MarkdownConverter, DocumentChunker, JSONExporter
# Step 1: Convert to Markdown
converter = MarkdownConverter()
conversion_result = converter.convert(
file_path="document.pdf",
output_dir="output/",
save_images=True
)
# Step 2: Chunk the markdown
chunker = DocumentChunker(
chunk_size=1000,
chunk_overlap=200
)
chunks = chunker.chunk_file(
markdown_path=conversion_result['markdown_path']
)
# Step 3: Export to JSON
exporter = JSONExporter()
json_path = exporter.export(
chunks=chunks,
output_path="output/chunks.json"
)
📚 Supported File Formats
- Documents: PDF, DOCX, DOC, RTF, ODT, TXT, MD
- Presentations: PPTX, PPT, ODP
- Spreadsheets: XLSX, XLS, ODS
- Web: HTML, HTM
Note: Audio/video files (MP3, MP4, etc.) require ffmpeg. See docs/FFMPEG_AUDIO.md for details.
📂 Output Directory Structure
After processing a document, the output directory will contain:
output/
├── document.md # Converted markdown file
├── document_chunks.json # Chunks with metadata and statistics
└── images/ # Extracted images (if any)
├── page1_img1.png
├── page2_img1.jpg
├── page3_img1.png
└── page3_img2.jpg
Example Output Files
document.md - Markdown conversion with image references:
# Document Title
Document content converted to markdown format...
## Extracted Images


document_chunks.json - Structured chunk data:
{
"source_info": {
"source_file": "document.pdf",
"markdown_file": "output/document.md",
"images_dir": "output/images"
},
"chunks": [
{
"text": "Document content chunk...",
"metadata": {
"Header 1": "Introduction",
"chunk_index": 0,
"source_file": "output/document.md"
}
}
],
"total_chunks": 42,
"statistics": {
"total_characters": 48392,
"avg_chunk_size": 1152.19,
"min_chunk_size": 234,
"max_chunk_size": 1000
},
"exported_at": "2025-10-10T10:30:45.123456"
}
images/ - Extracted images with organized naming:
- PDF images:
page{N}_img{M}.{ext}(e.g.,page1_img1.png) - DOCX images:
docx_img{N}.{ext}(e.g.,docx_img1.jpg) - PPTX images:
slide{N}_img{M}.{ext}(e.g.,slide1_img1.png)
💡 Tip: The images directory is only created if the document contains images and
save_images=True(default).
🎛️ Configuration Options
Chunking Parameters
processor = MarkitDownProcessor(
chunk_size=1000, # Maximum characters per chunk
chunk_overlap=200, # Overlap between consecutive chunks
use_markdown_splitter=True, # Use markdown-aware splitting
json_indent=2 # JSON formatting
)
Processing Options
result = processor.process(
file_path="input.pdf",
output_dir="output/",
save_images=True, # Save extracted images
include_image_summaries=False, # Add image summaries to chunks
image_summarizer=my_summarizer_func, # Custom image summarizer
skip_conversion=False, # Skip if already markdown
skip_chunking=False, # Only convert
skip_export=False # Don't export JSON
)
🔬 Advanced Usage
Custom Image Summarization
def summarize_image(image_path: str) -> str:
"""Your custom image summarization logic."""
# Example: Use vision AI model
from my_vision_model import analyze_image
return analyze_image(image_path)
processor = MarkitDownProcessor()
result = processor.process(
file_path="document.pdf",
output_dir="output/",
include_image_summaries=True,
image_summarizer=summarize_image
)
Batch Processing
processor = MarkitDownProcessor()
files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
results = processor.process_batch(
file_paths=files,
output_dir="output/"
)
for result in results:
if "error" in result:
print(f"Failed: {result['input_file']} - {result['error']}")
else:
print(f"Success: {result['input_file']}")
Individual Step Processing
processor = MarkitDownProcessor()
# Only convert to markdown
conversion = processor.convert_only(
file_path="document.pdf",
output_dir="output/"
)
# Only chunk existing markdown
chunks = processor.chunk_only(
markdown_path="document.md"
)
# Only export chunks
processor.export_only(
chunks=chunks,
output_path="output/chunks.json",
source_info={"source": "document.md"}
)
Custom Markdown Header Splitting
from markitdown_chunker import DocumentChunker
chunker = DocumentChunker(
chunk_size=1000,
chunk_overlap=200,
use_markdown_splitter=True,
headers_to_split_on=[
("#", "Title"),
("##", "Section"),
("###", "Subsection"),
("####", "Paragraph")
]
)
chunks = chunker.chunk_file("document.md")
📤 Output Format
JSON Structure
{
"source_info": {
"source_file": "document.pdf",
"markdown_file": "output/document.md",
"output_dir": "output/",
"images_dir": "output/images"
},
"chunks": [
{
"text": "Chunk content here...",
"metadata": {
"Header 1": "Introduction",
"Header 2": "Overview",
"sub_chunk_index": 0,
"total_sub_chunks": 1,
"source_file": "output/document.md",
"chunk_size_config": 1000,
"chunk_overlap_config": 200
}
}
],
"total_chunks": 42,
"statistics": {
"total_characters": 48392,
"avg_chunk_size": 1152.19,
"min_chunk_size": 234,
"max_chunk_size": 1000
},
"exported_at": "2025-10-09T10:30:45.123456"
}
🛠️ CLI Reference
markitdown-chunker [-h] [--convert-only | --chunk-only | --no-export]
[--chunk-size CHUNK_SIZE] [--overlap OVERLAP]
[--no-markdown-splitter] [--no-images]
[--include-image-summaries] [--json-indent JSON_INDENT]
[--list-formats] [--version] [-v]
input output
Positional Arguments:
input Input file path
output Output directory
Optional Arguments:
-h, --help Show help message
--convert-only Only convert to markdown
--chunk-only Only chunk existing markdown
--no-export Skip JSON export
--chunk-size SIZE Maximum chunk size (default: 1000)
--overlap SIZE Chunk overlap (default: 200)
--no-markdown-splitter Disable markdown-aware splitting
--no-images Don't save extracted images
--json-indent N JSON indentation (default: 2)
--list-formats List supported formats
--version Show version
-v, --verbose Enable verbose output
🧪 Development
Setup Development Environment
git clone https://github.com/yourusername/markitdown-chunker.git
cd markitdown-chunker
pip install -e ".[dev]"
Run Tests
pytest tests/
pytest --cov=markitdown_chunker tests/
Code Formatting
black markitdown_chunker/
flake8 markitdown_chunker/
mypy markitdown_chunker/
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built on top of markitdown by Microsoft
- Uses LangChain text splitters for intelligent chunking
📞 Support
- 🐛 Report a bug
- 💡 Request a feature
- 📖 Documentation
- 🖼️ Image Extraction Guide
- 🎵 Audio/Video Processing Guide
🗺️ Roadmap
- Support for more document formats
- Advanced chunking strategies (semantic, sentence-based)
- Integration with vector databases
- Web UI for document processing
- Cloud storage integration (S3, GCS, Azure)
- Parallel batch processing
- Custom output formats (CSV, Parquet, etc.)
Made with ❤️ by the MarkitDown Chunker community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markitdown_chunker-0.1.0.tar.gz.
File metadata
- Download URL: markitdown_chunker-0.1.0.tar.gz
- Upload date:
- Size: 42.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab81de683514dd282c8264cf2a6b394ed8b3e102f7900da326f7378e892c2521
|
|
| MD5 |
2f62fc2b3403ca4bf1a52425951f9c9a
|
|
| BLAKE2b-256 |
6b382597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4
|
Provenance
The following attestation bundles were made for markitdown_chunker-0.1.0.tar.gz:
Publisher:
publish.yml on Naveenkumarar/markitdown-chunker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markitdown_chunker-0.1.0.tar.gz -
Subject digest:
ab81de683514dd282c8264cf2a6b394ed8b3e102f7900da326f7378e892c2521 - Sigstore transparency entry: 598092413
- Sigstore integration time:
-
Permalink:
Naveenkumarar/markitdown-chunker@5162fa6ec05d6829383785f2f3b7feb3bd5573c8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Naveenkumarar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5162fa6ec05d6829383785f2f3b7feb3bd5573c8 -
Trigger Event:
release
-
Statement type:
File details
Details for the file markitdown_chunker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: markitdown_chunker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b39bc2dfb2c398af9262f1e0382950812b399060b1c8b6d30df2a5e80760ee8
|
|
| MD5 |
4d1ad673dc3d9cb73a00208140b53c41
|
|
| BLAKE2b-256 |
e20528702b3ff63d596a6e1f5b86f5157cdb871b0ade1fd94d5c60eb6cd99a06
|
Provenance
The following attestation bundles were made for markitdown_chunker-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Naveenkumarar/markitdown-chunker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markitdown_chunker-0.1.0-py3-none-any.whl -
Subject digest:
0b39bc2dfb2c398af9262f1e0382950812b399060b1c8b6d30df2a5e80760ee8 - Sigstore transparency entry: 598092415
- Sigstore integration time:
-
Permalink:
Naveenkumarar/markitdown-chunker@5162fa6ec05d6829383785f2f3b7feb3bd5573c8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Naveenkumarar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5162fa6ec05d6829383785f2f3b7feb3bd5573c8 -
Trigger Event:
release
-
Statement type: