Convert documents to markdown, chunk them intelligently, and export structured data

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

MarkitDown Chunker

A powerful Python package that converts documents to markdown, intelligently chunks them, and exports structured data. Built as an add-on to the markitdown package with advanced chunking capabilities using LangChain.

✨ Features

📄 Multi-format Support: Convert PDF, DOCX, PPTX, XLSX, HTML, RTF, ODT, and more to markdown
🖼️ Image Extraction: Automatically extract images from PDF, DOCX, PPTX files (requires optional dependencies)
🎨 Image Summarization: Optional AI-powered image descriptions for better context
✂️ Smart Chunking: Markdown-aware text splitting that respects document structure
📊 Structured Export: Export chunks with metadata to JSON format
🔧 Flexible Pipeline: Run individual steps or complete pipeline as needed
🎯 CLI & Python API: Use from command line or integrate into your Python applications

📦 Installation

Basic Installation

pip install markitdown-chunker

With Image Extraction Support

To extract images from PDF, DOCX, and PPTX files:

pip install "markitdown-chunker[images]"

See Image Extraction Guide for details.

From Source

git clone https://github.com/Naveenkumarar/markitdown-chunker.git
cd markitdown-chunker
pip install -e .
# Or with image support:
pip install -e ".[images]"

🚀 Quick Start

Command Line Interface

# Convert, chunk, and export (full pipeline)
markitdown-chunker input.pdf output/

# Convert only
markitdown-chunker document.docx output/ --convert-only

# Chunk existing markdown
markitdown-chunker document.md output/ --chunk-only

# Custom chunk size and overlap
markitdown-chunker input.pdf output/ --chunk-size 2000 --overlap 400

# List supported formats
markitdown-chunker --list-formats

Python API

Complete Pipeline

from markitdown_chunker import MarkitDownProcessor

# Initialize processor with custom settings
processor = MarkitDownProcessor(
    chunk_size=1000,
    chunk_overlap=200,
    use_markdown_splitter=True
)

# Process a document (all steps)
result = processor.process(
    file_path="document.pdf",
    output_dir="output/"
)

print(f"Markdown saved to: {result['conversion']['markdown_path']}")
print(f"Created {len(result['chunking']['chunks'])} chunks")
print(f"JSON exported to: {result['export']['json_path']}")

Step-by-Step Processing

from markitdown_chunker import MarkdownConverter, DocumentChunker, JSONExporter

# Step 1: Convert to Markdown
converter = MarkdownConverter()
conversion_result = converter.convert(
    file_path="document.pdf",
    output_dir="output/",
    save_images=True
)

# Step 2: Chunk the markdown
chunker = DocumentChunker(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = chunker.chunk_file(
    markdown_path=conversion_result['markdown_path']
)

# Step 3: Export to JSON
exporter = JSONExporter()
json_path = exporter.export(
    chunks=chunks,
    output_path="output/chunks.json"
)

📚 Supported File Formats

Documents: PDF, DOCX, DOC, RTF, ODT, TXT, MD
Presentations: PPTX, PPT, ODP
Spreadsheets: XLSX, XLS, ODS
Web: HTML, HTM

Note: Audio/video files (MP3, MP4, etc.) require ffmpeg. See docs/FFMPEG_AUDIO.md for details.

📂 Output Directory Structure

After processing a document, the output directory will contain:

output/
├── document.md                    # Converted markdown file
├── document_chunks.json           # Chunks with metadata and statistics
└── images/                        # Extracted images (if any)
    ├── page1_img1.png
    ├── page2_img1.jpg
    ├── page3_img1.png
    └── page3_img2.jpg

Example Output Files

document.md - Markdown conversion with image references:

# Document Title

Document content converted to markdown format...

## Extracted Images

![Page 1 Image 1](images/page1_img1.png)

![Page 2 Image 1](images/page2_img1.jpg)

document_chunks.json - Structured chunk data:

{
  "source_info": {
    "source_file": "document.pdf",
    "markdown_file": "output/document.md",
    "images_dir": "output/images"
  },
  "chunks": [
    {
      "text": "Document content chunk...",
      "metadata": {
        "Header 1": "Introduction",
        "chunk_index": 0,
        "source_file": "output/document.md"
      }
    }
  ],
  "total_chunks": 42,
  "statistics": {
    "total_characters": 48392,
    "avg_chunk_size": 1152.19,
    "min_chunk_size": 234,
    "max_chunk_size": 1000
  },
  "exported_at": "2025-10-10T10:30:45.123456"
}

images/ - Extracted images with organized naming:

PDF images: page{N}_img{M}.{ext} (e.g., page1_img1.png)
DOCX images: docx_img{N}.{ext} (e.g., docx_img1.jpg)
PPTX images: slide{N}_img{M}.{ext} (e.g., slide1_img1.png)

💡 Tip: The images directory is only created if the document contains images and save_images=True (default).

🎛️ Configuration Options

Chunking Parameters

processor = MarkitDownProcessor(
    chunk_size=1000,           # Maximum characters per chunk
    chunk_overlap=200,          # Overlap between consecutive chunks
    use_markdown_splitter=True, # Use markdown-aware splitting
    json_indent=2              # JSON formatting
)

Processing Options

result = processor.process(
    file_path="input.pdf",
    output_dir="output/",
    save_images=True,                    # Save extracted images
    include_image_summaries=False,       # Add image summaries to chunks
    image_summarizer=my_summarizer_func, # Custom image summarizer
    skip_conversion=False,               # Skip if already markdown
    skip_chunking=False,                 # Only convert
    skip_export=False                    # Don't export JSON
)

🔬 Advanced Usage

Custom Image Summarization

def summarize_image(image_path: str) -> str:
    """Your custom image summarization logic."""
    # Example: Use vision AI model
    from my_vision_model import analyze_image
    return analyze_image(image_path)

processor = MarkitDownProcessor()
result = processor.process(
    file_path="document.pdf",
    output_dir="output/",
    include_image_summaries=True,
    image_summarizer=summarize_image
)

Batch Processing

processor = MarkitDownProcessor()

files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
results = processor.process_batch(
    file_paths=files,
    output_dir="output/"
)

for result in results:
    if "error" in result:
        print(f"Failed: {result['input_file']} - {result['error']}")
    else:
        print(f"Success: {result['input_file']}")

Individual Step Processing

processor = MarkitDownProcessor()

# Only convert to markdown
conversion = processor.convert_only(
    file_path="document.pdf",
    output_dir="output/"
)

# Only chunk existing markdown
chunks = processor.chunk_only(
    markdown_path="document.md"
)

# Only export chunks
processor.export_only(
    chunks=chunks,
    output_path="output/chunks.json",
    source_info={"source": "document.md"}
)

Custom Markdown Header Splitting

from markitdown_chunker import DocumentChunker

chunker = DocumentChunker(
    chunk_size=1000,
    chunk_overlap=200,
    use_markdown_splitter=True,
    headers_to_split_on=[
        ("#", "Title"),
        ("##", "Section"),
        ("###", "Subsection"),
        ("####", "Paragraph")
    ]
)

chunks = chunker.chunk_file("document.md")

📤 Output Format

JSON Structure

{
  "source_info": {
    "source_file": "document.pdf",
    "markdown_file": "output/document.md",
    "output_dir": "output/",
    "images_dir": "output/images"
  },
  "chunks": [
    {
      "text": "Chunk content here...",
      "metadata": {
        "Header 1": "Introduction",
        "Header 2": "Overview",
        "sub_chunk_index": 0,
        "total_sub_chunks": 1,
        "source_file": "output/document.md",
        "chunk_size_config": 1000,
        "chunk_overlap_config": 200
      }
    }
  ],
  "total_chunks": 42,
  "statistics": {
    "total_characters": 48392,
    "avg_chunk_size": 1152.19,
    "min_chunk_size": 234,
    "max_chunk_size": 1000
  },
  "exported_at": "2025-10-09T10:30:45.123456"
}

🛠️ CLI Reference

markitdown-chunker [-h] [--convert-only | --chunk-only | --no-export]
                    [--chunk-size CHUNK_SIZE] [--overlap OVERLAP]
                    [--no-markdown-splitter] [--no-images]
                    [--include-image-summaries] [--json-indent JSON_INDENT]
                    [--list-formats] [--version] [-v]
                    input output

Positional Arguments:
  input                 Input file path
  output                Output directory

Optional Arguments:
  -h, --help            Show help message
  --convert-only        Only convert to markdown
  --chunk-only          Only chunk existing markdown
  --no-export           Skip JSON export
  --chunk-size SIZE     Maximum chunk size (default: 1000)
  --overlap SIZE        Chunk overlap (default: 200)
  --no-markdown-splitter Disable markdown-aware splitting
  --no-images           Don't save extracted images
  --json-indent N       JSON indentation (default: 2)
  --list-formats        List supported formats
  --version             Show version
  -v, --verbose         Enable verbose output

🧪 Development

Setup Development Environment

git clone https://github.com/yourusername/markitdown-chunker.git
cd markitdown-chunker
pip install -e ".[dev]"

Run Tests

pytest tests/
pytest --cov=markitdown_chunker tests/

Code Formatting

black markitdown_chunker/
flake8 markitdown_chunker/
mypy markitdown_chunker/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built on top of markitdown by Microsoft
Uses LangChain text splitters for intelligent chunking

📞 Support

🗺️ Roadmap

Support for more document formats
Advanced chunking strategies (semantic, sentence-based)
Integration with vector databases
Web UI for document processing
Cloud storage integration (S3, GCS, Azure)
Parallel batch processing
Custom output formats (CSV, Parquet, etc.)

Made with ❤️ by the MarkitDown Chunker community

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

naveenar

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Oct 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markitdown_chunker-0.1.0.tar.gz (42.8 kB view details)

Uploaded Oct 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markitdown_chunker-0.1.0-py3-none-any.whl (36.1 kB view details)

Uploaded Oct 10, 2025 Python 3

File details

Details for the file markitdown_chunker-0.1.0.tar.gz.

File metadata

Download URL: markitdown_chunker-0.1.0.tar.gz
Upload date: Oct 10, 2025
Size: 42.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markitdown_chunker-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ab81de683514dd282c8264cf2a6b394ed8b3e102f7900da326f7378e892c2521`
MD5	`2f62fc2b3403ca4bf1a52425951f9c9a`
BLAKE2b-256	`6b382597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markitdown_chunker-0.1.0.tar.gz:

Publisher: publish.yml on Naveenkumarar/markitdown-chunker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markitdown_chunker-0.1.0.tar.gz
- Subject digest: ab81de683514dd282c8264cf2a6b394ed8b3e102f7900da326f7378e892c2521
- Sigstore transparency entry: 598092413
- Sigstore integration time: Oct 10, 2025
Source repository:
- Permalink: Naveenkumarar/markitdown-chunker@5162fa6ec05d6829383785f2f3b7feb3bd5573c8
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Naveenkumarar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5162fa6ec05d6829383785f2f3b7feb3bd5573c8
- Trigger Event: release

File details

Details for the file markitdown_chunker-0.1.0-py3-none-any.whl.

File metadata

Download URL: markitdown_chunker-0.1.0-py3-none-any.whl
Upload date: Oct 10, 2025
Size: 36.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markitdown_chunker-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b39bc2dfb2c398af9262f1e0382950812b399060b1c8b6d30df2a5e80760ee8`
MD5	`4d1ad673dc3d9cb73a00208140b53c41`
BLAKE2b-256	`e20528702b3ff63d596a6e1f5b86f5157cdb871b0ade1fd94d5c60eb6cd99a06`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markitdown_chunker-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Naveenkumarar/markitdown-chunker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markitdown_chunker-0.1.0-py3-none-any.whl
- Subject digest: 0b39bc2dfb2c398af9262f1e0382950812b399060b1c8b6d30df2a5e80760ee8
- Sigstore transparency entry: 598092415
- Sigstore integration time: Oct 10, 2025
Source repository:
- Permalink: Naveenkumarar/markitdown-chunker@5162fa6ec05d6829383785f2f3b7feb3bd5573c8
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Naveenkumarar
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5162fa6ec05d6829383785f2f3b7feb3bd5573c8
- Trigger Event: release

markitdown-chunker 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

MarkitDown Chunker

✨ Features

📦 Installation

Basic Installation

With Image Extraction Support

From Source

🚀 Quick Start

Command Line Interface

Python API

Complete Pipeline

Step-by-Step Processing

📚 Supported File Formats

📂 Output Directory Structure

Example Output Files

🎛️ Configuration Options

Chunking Parameters

Processing Options

🔬 Advanced Usage

Custom Image Summarization

Batch Processing

Individual Step Processing

Custom Markdown Header Splitting

📤 Output Format

JSON Structure

🛠️ CLI Reference

🧪 Development

Setup Development Environment

Run Tests

Code Formatting

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

🗺️ Roadmap

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance