Skip to main content

Convert documents to markdown, chunk them intelligently, and export structured data

Project description

MarkitDown Chunker

A powerful Python package that converts documents to markdown, intelligently chunks them, and exports structured data. Built as an add-on to the markitdown package with advanced chunking capabilities using LangChain.

✨ Features

  • 📄 Multi-format Support: Convert PDF, DOCX, PPTX, XLSX, HTML, RTF, ODT, and more to markdown
  • 🖼️ Image Extraction: Automatically extract images from PDF, DOCX, PPTX files (requires optional dependencies)
  • 🎨 Image Summarization: Optional AI-powered image descriptions for better context
  • ✂️ Smart Chunking: Markdown-aware text splitting that respects document structure
  • 📊 Structured Export: Export chunks with metadata to JSON format
  • 🔧 Flexible Pipeline: Run individual steps or complete pipeline as needed
  • 🎯 CLI & Python API: Use from command line or integrate into your Python applications

📦 Installation

Basic Installation

pip install markitdown-chunker

With Image Extraction Support

To extract images from PDF, DOCX, and PPTX files:

pip install "markitdown-chunker[images]"

See Image Extraction Guide for details.

From Source

git clone https://github.com/Naveenkumarar/markitdown-chunker.git
cd markitdown-chunker
pip install -e .
# Or with image support:
pip install -e ".[images]"

🚀 Quick Start

Command Line Interface

# Convert, chunk, and export (full pipeline)
markitdown-chunker input.pdf output/

# Convert only
markitdown-chunker document.docx output/ --convert-only

# Chunk existing markdown
markitdown-chunker document.md output/ --chunk-only

# Custom chunk size and overlap
markitdown-chunker input.pdf output/ --chunk-size 2000 --overlap 400

# List supported formats
markitdown-chunker --list-formats

Python API

Complete Pipeline

from markitdown_chunker import MarkitDownProcessor

# Initialize processor with custom settings
processor = MarkitDownProcessor(
    chunk_size=1000,
    chunk_overlap=200,
    use_markdown_splitter=True
)

# Process a document (all steps)
result = processor.process(
    file_path="document.pdf",
    output_dir="output/"
)

print(f"Markdown saved to: {result['conversion']['markdown_path']}")
print(f"Created {len(result['chunking']['chunks'])} chunks")
print(f"JSON exported to: {result['export']['json_path']}")

Step-by-Step Processing

from markitdown_chunker import MarkdownConverter, DocumentChunker, JSONExporter

# Step 1: Convert to Markdown
converter = MarkdownConverter()
conversion_result = converter.convert(
    file_path="document.pdf",
    output_dir="output/",
    save_images=True
)

# Step 2: Chunk the markdown
chunker = DocumentChunker(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = chunker.chunk_file(
    markdown_path=conversion_result['markdown_path']
)

# Step 3: Export to JSON
exporter = JSONExporter()
json_path = exporter.export(
    chunks=chunks,
    output_path="output/chunks.json"
)

📚 Supported File Formats

  • Documents: PDF, DOCX, DOC, RTF, ODT, TXT, MD
  • Presentations: PPTX, PPT, ODP
  • Spreadsheets: XLSX, XLS, ODS
  • Web: HTML, HTM

Note: Audio/video files (MP3, MP4, etc.) require ffmpeg. See docs/FFMPEG_AUDIO.md for details.

📂 Output Directory Structure

After processing a document, the output directory will contain:

output/
├── document.md                    # Converted markdown file
├── document_chunks.json           # Chunks with metadata and statistics
└── images/                        # Extracted images (if any)
    ├── page1_img1.png
    ├── page2_img1.jpg
    ├── page3_img1.png
    └── page3_img2.jpg

Example Output Files

document.md - Markdown conversion with image references:

# Document Title

Document content converted to markdown format...

## Extracted Images

![Page 1 Image 1](images/page1_img1.png)

![Page 2 Image 1](images/page2_img1.jpg)

document_chunks.json - Structured chunk data:

{
  "source_info": {
    "source_file": "document.pdf",
    "markdown_file": "output/document.md",
    "images_dir": "output/images"
  },
  "chunks": [
    {
      "text": "Document content chunk...",
      "metadata": {
        "Header 1": "Introduction",
        "chunk_index": 0,
        "source_file": "output/document.md"
      }
    }
  ],
  "total_chunks": 42,
  "statistics": {
    "total_characters": 48392,
    "avg_chunk_size": 1152.19,
    "min_chunk_size": 234,
    "max_chunk_size": 1000
  },
  "exported_at": "2025-10-10T10:30:45.123456"
}

images/ - Extracted images with organized naming:

  • PDF images: page{N}_img{M}.{ext} (e.g., page1_img1.png)
  • DOCX images: docx_img{N}.{ext} (e.g., docx_img1.jpg)
  • PPTX images: slide{N}_img{M}.{ext} (e.g., slide1_img1.png)

💡 Tip: The images directory is only created if the document contains images and save_images=True (default).

🎛️ Configuration Options

Chunking Parameters

processor = MarkitDownProcessor(
    chunk_size=1000,           # Maximum characters per chunk
    chunk_overlap=200,          # Overlap between consecutive chunks
    use_markdown_splitter=True, # Use markdown-aware splitting
    json_indent=2              # JSON formatting
)

Processing Options

result = processor.process(
    file_path="input.pdf",
    output_dir="output/",
    save_images=True,                    # Save extracted images
    include_image_summaries=False,       # Add image summaries to chunks
    image_summarizer=my_summarizer_func, # Custom image summarizer
    skip_conversion=False,               # Skip if already markdown
    skip_chunking=False,                 # Only convert
    skip_export=False                    # Don't export JSON
)

🔬 Advanced Usage

Custom Image Summarization

def summarize_image(image_path: str) -> str:
    """Your custom image summarization logic."""
    # Example: Use vision AI model
    from my_vision_model import analyze_image
    return analyze_image(image_path)

processor = MarkitDownProcessor()
result = processor.process(
    file_path="document.pdf",
    output_dir="output/",
    include_image_summaries=True,
    image_summarizer=summarize_image
)

Batch Processing

processor = MarkitDownProcessor()

files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
results = processor.process_batch(
    file_paths=files,
    output_dir="output/"
)

for result in results:
    if "error" in result:
        print(f"Failed: {result['input_file']} - {result['error']}")
    else:
        print(f"Success: {result['input_file']}")

Individual Step Processing

processor = MarkitDownProcessor()

# Only convert to markdown
conversion = processor.convert_only(
    file_path="document.pdf",
    output_dir="output/"
)

# Only chunk existing markdown
chunks = processor.chunk_only(
    markdown_path="document.md"
)

# Only export chunks
processor.export_only(
    chunks=chunks,
    output_path="output/chunks.json",
    source_info={"source": "document.md"}
)

Custom Markdown Header Splitting

from markitdown_chunker import DocumentChunker

chunker = DocumentChunker(
    chunk_size=1000,
    chunk_overlap=200,
    use_markdown_splitter=True,
    headers_to_split_on=[
        ("#", "Title"),
        ("##", "Section"),
        ("###", "Subsection"),
        ("####", "Paragraph")
    ]
)

chunks = chunker.chunk_file("document.md")

📤 Output Format

JSON Structure

{
  "source_info": {
    "source_file": "document.pdf",
    "markdown_file": "output/document.md",
    "output_dir": "output/",
    "images_dir": "output/images"
  },
  "chunks": [
    {
      "text": "Chunk content here...",
      "metadata": {
        "Header 1": "Introduction",
        "Header 2": "Overview",
        "sub_chunk_index": 0,
        "total_sub_chunks": 1,
        "source_file": "output/document.md",
        "chunk_size_config": 1000,
        "chunk_overlap_config": 200
      }
    }
  ],
  "total_chunks": 42,
  "statistics": {
    "total_characters": 48392,
    "avg_chunk_size": 1152.19,
    "min_chunk_size": 234,
    "max_chunk_size": 1000
  },
  "exported_at": "2025-10-09T10:30:45.123456"
}

🛠️ CLI Reference

markitdown-chunker [-h] [--convert-only | --chunk-only | --no-export]
                    [--chunk-size CHUNK_SIZE] [--overlap OVERLAP]
                    [--no-markdown-splitter] [--no-images]
                    [--include-image-summaries] [--json-indent JSON_INDENT]
                    [--list-formats] [--version] [-v]
                    input output

Positional Arguments:
  input                 Input file path
  output                Output directory

Optional Arguments:
  -h, --help            Show help message
  --convert-only        Only convert to markdown
  --chunk-only          Only chunk existing markdown
  --no-export           Skip JSON export
  --chunk-size SIZE     Maximum chunk size (default: 1000)
  --overlap SIZE        Chunk overlap (default: 200)
  --no-markdown-splitter Disable markdown-aware splitting
  --no-images           Don't save extracted images
  --json-indent N       JSON indentation (default: 2)
  --list-formats        List supported formats
  --version             Show version
  -v, --verbose         Enable verbose output

🧪 Development

Setup Development Environment

git clone https://github.com/yourusername/markitdown-chunker.git
cd markitdown-chunker
pip install -e ".[dev]"

Run Tests

pytest tests/
pytest --cov=markitdown_chunker tests/

Code Formatting

black markitdown_chunker/
flake8 markitdown_chunker/
mypy markitdown_chunker/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built on top of markitdown by Microsoft
  • Uses LangChain text splitters for intelligent chunking

📞 Support

🗺️ Roadmap

  • Support for more document formats
  • Advanced chunking strategies (semantic, sentence-based)
  • Integration with vector databases
  • Web UI for document processing
  • Cloud storage integration (S3, GCS, Azure)
  • Parallel batch processing
  • Custom output formats (CSV, Parquet, etc.)

Made with ❤️ by the MarkitDown Chunker community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_chunker-0.1.0.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markitdown_chunker-0.1.0-py3-none-any.whl (36.1 kB view details)

Uploaded Python 3

File details

Details for the file markitdown_chunker-0.1.0.tar.gz.

File metadata

  • Download URL: markitdown_chunker-0.1.0.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markitdown_chunker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ab81de683514dd282c8264cf2a6b394ed8b3e102f7900da326f7378e892c2521
MD5 2f62fc2b3403ca4bf1a52425951f9c9a
BLAKE2b-256 6b382597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for markitdown_chunker-0.1.0.tar.gz:

Publisher: publish.yml on Naveenkumarar/markitdown-chunker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file markitdown_chunker-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for markitdown_chunker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b39bc2dfb2c398af9262f1e0382950812b399060b1c8b6d30df2a5e80760ee8
MD5 4d1ad673dc3d9cb73a00208140b53c41
BLAKE2b-256 e20528702b3ff63d596a6e1f5b86f5157cdb871b0ade1fd94d5c60eb6cd99a06

See more details on using hashes here.

Provenance

The following attestation bundles were made for markitdown_chunker-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Naveenkumarar/markitdown-chunker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page