Skip to main content

Universal document to Markdown converter with quality/fast dual-mode pipeline

Project description

Any2MD - 格式工厂分发引擎

Universal document to Markdown converter with quality/fast dual-mode pipeline

Python 3.10+ License: MIT

Any2MD converts documents to Markdown with two processing modes:

  • Quality Mode: Uses specialized converters (pypandoc, mammoth, markdownify) for superior output
  • Fast Mode: Uses markitdown for quick "good enough" results

Features

  • Dual-Mode Pipeline: Quality vs Fast mode for optimal results
  • PDF Heavy Channel: MarkItDown (primary) + pypdfium2 (fallback)
  • Async Batch Processing: Concurrent file conversion with Semaphore control
  • Smart Retry: Tenacity-based retry for transient failures
  • Non-Destructive Output: Auto-incremented filenames, never overwrites
  • Organized Output: Folder batch outputs to {folder}_converted/
  • BAT Drag-and-Drop: Zero-config usage on Windows

Quick Start

Installation

# From source (development)
git clone https://github.com/1StepMore/Any2MD.git
cd Any2MD
pip install -e .

# From PyPI
pip install Any2MD-1StepMore

Usage

Drag-and-Drop (Windows)

1. Drag a file onto bat\run.bat
2. Done! Markdown appears next to original

Drag-and-Drop Folder (Windows)

1. Drag a folder onto bat\run.bat
2. Done! Markdown files appear in {folder}_converted/

Command Line

# Single file conversion
python cli.py -i document.pdf

# With quality mode (better output)
python cli.py -i document.docx --mode quality

# Batch folder processing
python cli.py -i ./documents --concurrency 4

# Verbose logging
python cli.py -i document.pdf --verbose

# Full options
python cli.py --help

Python API

from any2md import convert_to_markdown

# Basic usage
md = convert_to_markdown("document.pdf")
print(md)

# With options
md = convert_to_markdown("document.docx", mode="quality")
md = convert_to_markdown("document.pdf", mode="fast", pdf_engine="heavy")

# Error handling
from any2md import (
    convert_to_markdown,
    FileTooLargeError,
    UnsupportedFormatError,
    ConversionError,
)

try:
    md = convert_to_markdown("document.pdf")
except FileTooLargeError:
    print("File too large (max 50MB)")
except UnsupportedFormatError:
    print("Format not supported")
except ConversionError as e:
    print(f"Conversion failed: {e}")

CLI Options

Option Short Description Default
--input -i Input file or folder Required
--output -o Output directory Same as input
--mode - quality or fast From config
--pdf-engine - light (markitdown) or heavy (MarkItDown) From config
--concurrency - Max parallel conversions From config
--config - Config file path config.yaml
--verbose -v Enable debug logging false

Configuration

Edit config.yaml:

# Output mode: quality (best) or fast (quick)
output_mode: fast

# PDF engine: light (markitdown) or heavy (MarkItDown -> pypdfium2)
pdf_engine: light

# Max file size (MB) - files larger are skipped
max_file_size: 50

# Async concurrency level
concurrency: 4

# Retry attempts for transient failures
retry_count: 3

Tip: You can also access config programmatically via from wheels.config import get_config

Architecture

Any2MD/
├── bat/
│   └── run.bat              # Windows drag-and-drop entry
├── wheels/
│   ├── converters/         # Format-specific converters
│   │   ├── converter_pandoc.py   # Quality: docx/pptx/xlsx
│   │   ├── converter_mammoth.py   # Quality: complex DOCX
│   │   ├── converter_html.py      # Quality: HTML
│   │   ├── converter_pdf.py       # Dynamic: PDF (light/heavy)
│   │   └── converter_passthrough.py # .md/.txt passthrough
│   │   └── converter_markitdown.py # Fast: csv/json/xml/yaml/epub/zip
│   ├── dispatcher.py        # Format routing
│   ├── fast_lane.py        # markitdown wrapper
│   ├── cleaner.py           # Text post-processing
│   ├── logger.py            # Logging configuration
│   └── exceptions.py       # Exception hierarchy
├── any2md/                  # Public package wrapper
│   └── __init__.py          # Re-exports convert_to_markdown + exceptions
├── tests/                   # Test suite (83 tests)
│   ├── test_api.py
│   ├── test_exceptions.py
│   ├── test_config.py
│   ├── test_dispatcher.py
│   └── test_integration.py
├── .github/workflows/ci.yml  # GitHub Actions CI/CD
├── cli.py                  # Typer CLI entry
├── pipeline.py             # Async batch orchestration
├── pyproject.toml          # Package metadata (pip install -e .)
├── config.yaml             # Configuration
├── requirements.txt         # Dependencies
├── CHANGELOG.md             # Version history
└── CONTRIBUTING.md          # CI/CD release process

Supported Formats

Format Quality Mode Fast Mode
.docx pypandoc → mammoth fallback markitdown
.pptx / .xlsx pypandoc markitdown
.html / .htm markdownify markitdown
.pdf MarkItDown (heavy) or markitdown (light) markitdown
.md / .txt passthrough passthrough
.csv / .json / .xml - markitdown
.yaml / .yml - markitdown
.epub - markitdown
.zip - markitdown

PDF Engine

Light Mode: Uses markitdown with zero external dependencies.

Heavy Mode: Uses Microsoft MarkItDown for table/layout detection, falls back to pypdfium2 for raw text extraction.

Install heavy dependencies:

# Ubuntu/Debian
sudo apt install tesseract-ocr poppler-utils

# Windows (requires admin)
winget install --id UB-Mannheim.TesseractOCR -e
winget install --id oschwartz10612.Poppler -e

Error Handling

  • File too large: Files >50MB are skipped with error log
  • Unsupported format: Clear error message with supported formats
  • Missing dependencies: Install instructions in error message
  • Transient failures: Automatic retry (up to 3 attempts)
  • Batch continuation: Single file failure doesn't stop batch processing

Development

# Run CLI help
python cli.py --help

# Test with sample files
python cli.py -i sample.docx --mode quality

# Check environment
python check_env.py

License

MIT License - see LICENSE file for details

Credits

Built with these excellent open-source libraries:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

any2md_1stepmore-0.3.5.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

any2md_1stepmore-0.3.5-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file any2md_1stepmore-0.3.5.tar.gz.

File metadata

  • Download URL: any2md_1stepmore-0.3.5.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for any2md_1stepmore-0.3.5.tar.gz
Algorithm Hash digest
SHA256 7178f2559538632e3c2d70d0662f743dabdda20683c74fb4937e6aa06416050a
MD5 e222f1d3749927cc68c0712fe8ac72cc
BLAKE2b-256 7d2eb7db3880dfbe4ca126858e6ad24c215a26c662c3336fd972c878411108d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md_1stepmore-0.3.5.tar.gz:

Publisher: ci.yml on 1StepMore/Any2MD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file any2md_1stepmore-0.3.5-py3-none-any.whl.

File metadata

File hashes

Hashes for any2md_1stepmore-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 683ce3dfdb35c154138c9ef253341a14e19af9bae1dbb96a0696f4a8dfce2fca
MD5 2df15a1ff159bba3a346dfb4e5f1b524
BLAKE2b-256 4be2329506cdf7bcf56ef6bb73e56b75ac91beac7537bf1895327a9c794bc856

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md_1stepmore-0.3.5-py3-none-any.whl:

Publisher: ci.yml on 1StepMore/Any2MD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page