Skip to main content

Universal document to Markdown converter with quality/fast dual-mode pipeline

Project description

Any2MD - 格式工厂分发引擎

Universal document to Markdown converter with quality/fast dual-mode pipeline

Python 3.10+ License: MIT

Any2MD converts documents to Markdown with two processing modes:

  • Quality Mode: Uses specialized converters (pypandoc, mammoth, markdownify) for superior output
  • Fast Mode: Uses markitdown for quick "good enough" results

Features

  • Dual-Mode Pipeline: Quality vs Fast mode for optimal results
  • PDF Heavy Channel: MarkItDown (primary) + pypdfium2 (fallback)
  • Async Batch Processing: Concurrent file conversion with Semaphore control
  • Smart Retry: Tenacity-based retry for transient failures
  • Non-Destructive Output: Auto-incremented filenames, never overwrites
  • Organized Output: Folder batch outputs to {folder}_converted/
  • BAT Drag-and-Drop: Zero-config usage on Windows

Quick Start

Installation

# From source (development)
git clone https://github.com/1StepMore/Any2MD.git
cd Any2MD
pip install -e .

# Or install from PyPI (when available)
pip install any2md

Usage

Drag-and-Drop (Windows)

1. Drag a file onto bat\run.bat
2. Done! Markdown appears next to original

Drag-and-Drop Folder (Windows)

1. Drag a folder onto bat\run.bat
2. Done! Markdown files appear in {folder}_converted/

Command Line

# Single file conversion
python cli.py -i document.pdf

# With quality mode (better output)
python cli.py -i document.docx --mode quality

# Batch folder processing
python cli.py -i ./documents --concurrency 4

# Verbose logging
python cli.py -i document.pdf --verbose

# Full options
python cli.py --help

Python API

from any2md import convert_to_markdown

# Basic usage
md = convert_to_markdown("document.pdf")
print(md)

# With options
md = convert_to_markdown("document.docx", mode="quality")
md = convert_to_markdown("document.pdf", mode="fast", pdf_engine="heavy")

# Error handling
from any2md import (
    convert_to_markdown,
    FileTooLargeError,
    UnsupportedFormatError,
    ConversionError,
)

try:
    md = convert_to_markdown("document.pdf")
except FileTooLargeError:
    print("File too large (max 50MB)")
except UnsupportedFormatError:
    print("Format not supported")
except ConversionError as e:
    print(f"Conversion failed: {e}")

CLI Options

Option Short Description Default
--input -i Input file or folder Required
--output -o Output directory Same as input
--mode - quality or fast From config
--pdf-engine - light (markitdown) or heavy (MarkItDown) From config
--concurrency - Max parallel conversions From config
--config - Config file path config.yaml
--verbose -v Enable debug logging false

Configuration

Edit config.yaml:

# Output mode: quality (best) or fast (quick)
output_mode: fast

# PDF engine: light (markitdown) or heavy (MarkItDown -> pypdfium2)
pdf_engine: light

# Max file size (MB) - files larger are skipped
max_file_size: 50

# Async concurrency level
concurrency: 4

# Retry attempts for transient failures
retry_count: 3

Tip: You can also access config programmatically via from wheels.config import get_config

Architecture

Any2MD/
├── bat/
│   └── run.bat              # Windows drag-and-drop entry
├── wheels/
│   ├── converters/         # Format-specific converters
│   │   ├── converter_pandoc.py   # Quality: docx/pptx/xlsx
│   │   ├── converter_mammoth.py   # Quality: complex DOCX
│   │   ├── converter_html.py      # Quality: HTML
│   │   ├── converter_pdf.py       # Dynamic: PDF (light/heavy)
│   │   └── converter_passthrough.py # .md/.txt passthrough
│   │   └── converter_markitdown.py # Fast: csv/json/xml/yaml/epub/zip
│   ├── dispatcher.py        # Format routing
│   ├── fast_lane.py        # markitdown wrapper
│   ├── cleaner.py           # Text post-processing
│   └── logger.py           # Logging configuration
├── cli.py                  # Typer CLI entry
├── pipeline.py              # Async batch orchestration
├── config.yaml             # Configuration
└── requirements.txt        # Dependencies

Supported Formats

Format Quality Mode Fast Mode
.docx pypandoc → mammoth fallback markitdown
.pptx / .xlsx pypandoc markitdown
.html / .htm markdownify markitdown
.pdf MarkItDown (heavy) or markitdown (light) markitdown
.md / .txt passthrough passthrough
.csv / .json / .xml - markitdown
.yaml / .yml - markitdown
.epub - markitdown
.zip - markitdown

PDF Engine

Light Mode: Uses markitdown with zero external dependencies.

Heavy Mode: Uses Microsoft MarkItDown for table/layout detection, falls back to pypdfium2 for raw text extraction.

Install heavy dependencies:

# Ubuntu/Debian
sudo apt install tesseract-ocr poppler-utils

# Windows (requires admin)
winget install --id UB-Mannheim.TesseractOCR -e
winget install --id oschwartz10612.Poppler -e

Error Handling

  • File too large: Files >50MB are skipped with error log
  • Unsupported format: Clear error message with supported formats
  • Missing dependencies: Install instructions in error message
  • Transient failures: Automatic retry (up to 3 attempts)
  • Batch continuation: Single file failure doesn't stop batch processing

Development

# Run CLI help
python cli.py --help

# Test with sample files
python cli.py -i sample.docx --mode quality

# Check environment
python check_env.py

License

MIT License - see LICENSE file for details

Credits

Built with these excellent open-source libraries:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

any2md_1stepmore-0.3.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

any2md_1stepmore-0.3.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file any2md_1stepmore-0.3.0.tar.gz.

File metadata

  • Download URL: any2md_1stepmore-0.3.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for any2md_1stepmore-0.3.0.tar.gz
Algorithm Hash digest
SHA256 920f3896d929c7b4e541653251fe9cfe1a888df67ad8ee03a365f86e6300f1b0
MD5 717da4ca7e6bc691072fa4966c4d286f
BLAKE2b-256 ecd74b1a00c73ade3d6dbe08c27546dbc813123dd2e1bd73a7eb4b6b1c112760

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md_1stepmore-0.3.0.tar.gz:

Publisher: ci.yml on 1StepMore/Any2MD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file any2md_1stepmore-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for any2md_1stepmore-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96e375841701b9d6d0c93fee6b6b38560fd9e95f34a81212831b223c56f622fd
MD5 a957451abf1b7035ecffa5b4d22ae5c4
BLAKE2b-256 901b787b4207a72feb92ac5ec1b8ca7e955158b70c109dbac429436d331ca739

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md_1stepmore-0.3.0-py3-none-any.whl:

Publisher: ci.yml on 1StepMore/Any2MD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page