Universal document to Markdown converter with quality/fast dual-mode pipeline

These details have not been verified by PyPI

Project description

Any2MD - 格式工厂分发引擎

Universal document to Markdown converter with quality/fast dual-mode pipeline

Any2MD converts documents to Markdown with two processing modes:

Quality Mode: Uses specialized converters (pypandoc, mammoth, markdownify) for superior output
Fast Mode: Uses markitdown for quick "good enough" results

Features

Dual-Mode Pipeline: Quality vs Fast mode for optimal results
PDF Heavy Channel: MarkItDown (primary) + pypdfium2 (fallback)
Async Batch Processing: Concurrent file conversion with Semaphore control
Smart Retry: Tenacity-based retry for transient failures
Non-Destructive Output: Auto-incremented filenames, never overwrites
Organized Output: Folder batch outputs to {folder}_converted/
BAT Drag-and-Drop: Zero-config usage on Windows

Quick Start

Installation

# From source (development)
git clone https://github.com/1StepMore/Any2MD.git
cd Any2MD
pip install -e .

# From PyPI
pip install Any2MD-1StepMore

Usage

Drag-and-Drop (Windows)

1. Drag a file onto bat\run.bat
2. Done! Markdown appears next to original

Drag-and-Drop Folder (Windows)

1. Drag a folder onto bat\run.bat
2. Done! Markdown files appear in {folder}_converted/

Command Line

# Single file conversion
python cli.py -i document.pdf

# With quality mode (better output)
python cli.py -i document.docx --mode quality

# Batch folder processing
python cli.py -i ./documents --concurrency 4

# Verbose logging
python cli.py -i document.pdf --verbose

# Full options
python cli.py --help

Python API

from any2md import convert_to_markdown

# Basic usage
md = convert_to_markdown("document.pdf")
print(md)

# With options
md = convert_to_markdown("document.docx", mode="quality")
md = convert_to_markdown("document.pdf", mode="fast", pdf_engine="heavy")

# Error handling
from any2md import (
    convert_to_markdown,
    FileTooLargeError,
    UnsupportedFormatError,
    ConversionError,
)

try:
    md = convert_to_markdown("document.pdf")
except FileTooLargeError:
    print("File too large (max 50MB)")
except UnsupportedFormatError:
    print("Format not supported")
except ConversionError as e:
    print(f"Conversion failed: {e}")

CLI Options

Option	Short	Description	Default
`--input`	`-i`	Input file or folder	Required
`--output`	`-o`	Output directory	Same as input
`--mode`	-	`quality` or `fast`	From config
`--pdf-engine`	-	`light` (markitdown) or `heavy` (MarkItDown)	From config
`--concurrency`	-	Max parallel conversions	From config
`--config`	-	Config file path	`config.yaml`
`--verbose`	`-v`	Enable debug logging	`false`

Configuration

Edit config.yaml:

# Output mode: quality (best) or fast (quick)
output_mode: fast

# PDF engine: light (markitdown) or heavy (MarkItDown -> pypdfium2)
pdf_engine: light

# Max file size (MB) - files larger are skipped
max_file_size: 50

# Async concurrency level
concurrency: 4

# Retry attempts for transient failures
retry_count: 3

Tip: You can also access config programmatically via from wheels.config import get_config

Architecture

Any2MD/
├── bat/
│   └── run.bat              # Windows drag-and-drop entry
├── wheels/
│   ├── converters/         # Format-specific converters
│   │   ├── converter_pandoc.py   # Quality: docx/pptx/xlsx
│   │   ├── converter_mammoth.py   # Quality: complex DOCX
│   │   ├── converter_html.py      # Quality: HTML
│   │   ├── converter_pdf.py       # Dynamic: PDF (light/heavy)
│   │   └── converter_passthrough.py # .md/.txt passthrough
│   │   └── converter_markitdown.py # Fast: csv/json/xml/yaml/epub/zip
│   ├── dispatcher.py        # Format routing
│   ├── fast_lane.py        # markitdown wrapper
│   ├── cleaner.py           # Text post-processing
│   ├── logger.py            # Logging configuration
│   └── exceptions.py       # Exception hierarchy
├── any2md/                  # Public package wrapper
│   └── __init__.py          # Re-exports convert_to_markdown + exceptions
├── tests/                   # Test suite (83 tests)
│   ├── test_api.py
│   ├── test_exceptions.py
│   ├── test_config.py
│   ├── test_dispatcher.py
│   └── test_integration.py
├── .github/workflows/ci.yml  # GitHub Actions CI/CD
├── cli.py                  # Typer CLI entry
├── pipeline.py             # Async batch orchestration
├── pyproject.toml          # Package metadata (pip install -e .)
├── config.yaml             # Configuration
├── requirements.txt         # Dependencies
├── CHANGELOG.md             # Version history
└── CONTRIBUTING.md          # CI/CD release process

Supported Formats

Format	Quality Mode	Fast Mode
.docx	pypandoc → mammoth fallback	markitdown
.pptx / .xlsx	pypandoc	markitdown
.html / .htm	markdownify	markitdown
.pdf	MarkItDown (heavy) or markitdown (light)	markitdown
.md / .txt	passthrough	passthrough
.csv / .json / .xml	-	markitdown
.yaml / .yml	-	markitdown
.epub	-	markitdown
.zip	-	markitdown

PDF Engine

Light Mode: Uses markitdown with zero external dependencies.

Heavy Mode: Uses Microsoft MarkItDown for table/layout detection, falls back to pypdfium2 for raw text extraction.

Install heavy dependencies:

# Ubuntu/Debian
sudo apt install tesseract-ocr poppler-utils

# Windows (requires admin)
winget install --id UB-Mannheim.TesseractOCR -e
winget install --id oschwartz10612.Poppler -e

Error Handling

File too large: Files >50MB are skipped with error log
Unsupported format: Clear error message with supported formats
Missing dependencies: Install instructions in error message
Transient failures: Automatic retry (up to 3 attempts)
Batch continuation: Single file failure doesn't stop batch processing

Development

# Run CLI help
python cli.py --help

# Test with sample files
python cli.py -i sample.docx --mode quality

# Check environment
python check_env.py

License

MIT License - see LICENSE file for details

Credits

Built with these excellent open-source libraries:

pypandoc - Pandoc wrapper
mammoth - DOCX to Markdown
markdownify - HTML to Markdown
markitdown - Microsoft format converter
pypdfium2 - PDF rendering
tenacity - Retry logic
Typer - CLI framework

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.5

May 11, 2026

0.3.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

any2md_1stepmore-0.3.5.tar.gz (17.9 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

any2md_1stepmore-0.3.5-py3-none-any.whl (21.3 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file any2md_1stepmore-0.3.5.tar.gz.

File metadata

Download URL: any2md_1stepmore-0.3.5.tar.gz
Upload date: May 11, 2026
Size: 17.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for any2md_1stepmore-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`7178f2559538632e3c2d70d0662f743dabdda20683c74fb4937e6aa06416050a`
MD5	`e222f1d3749927cc68c0712fe8ac72cc`
BLAKE2b-256	`7d2eb7db3880dfbe4ca126858e6ad24c215a26c662c3336fd972c878411108d2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md_1stepmore-0.3.5.tar.gz:

Publisher: ci.yml on 1StepMore/Any2MD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: any2md_1stepmore-0.3.5.tar.gz
- Subject digest: 7178f2559538632e3c2d70d0662f743dabdda20683c74fb4937e6aa06416050a
- Sigstore transparency entry: 1506669598
- Sigstore integration time: May 11, 2026
Source repository:
- Permalink: 1StepMore/Any2MD@3afbea8f79e4539dcd49af26835170271a784ae2
- Branch / Tag: refs/tags/v0.3.5
- Owner: https://github.com/1StepMore
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@3afbea8f79e4539dcd49af26835170271a784ae2
- Trigger Event: push

File details

Details for the file any2md_1stepmore-0.3.5-py3-none-any.whl.

File metadata

Download URL: any2md_1stepmore-0.3.5-py3-none-any.whl
Upload date: May 11, 2026
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for any2md_1stepmore-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`683ce3dfdb35c154138c9ef253341a14e19af9bae1dbb96a0696f4a8dfce2fca`
MD5	`2df15a1ff159bba3a346dfb4e5f1b524`
BLAKE2b-256	`4be2329506cdf7bcf56ef6bb73e56b75ac91beac7537bf1895327a9c794bc856`

See more details on using hashes here.

Provenance

The following attestation bundles were made for any2md_1stepmore-0.3.5-py3-none-any.whl:

Publisher: ci.yml on 1StepMore/Any2MD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: any2md_1stepmore-0.3.5-py3-none-any.whl
- Subject digest: 683ce3dfdb35c154138c9ef253341a14e19af9bae1dbb96a0696f4a8dfce2fca
- Sigstore transparency entry: 1506669671
- Sigstore integration time: May 11, 2026
Source repository:
- Permalink: 1StepMore/Any2MD@3afbea8f79e4539dcd49af26835170271a784ae2
- Branch / Tag: refs/tags/v0.3.5
- Owner: https://github.com/1StepMore
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@3afbea8f79e4539dcd49af26835170271a784ae2
- Trigger Event: push

Any2MD-1StepMore 0.3.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Any2MD - 格式工厂分发引擎

Features

Quick Start

Installation

Usage

Python API

CLI Options

Configuration

Architecture

Supported Formats

PDF Engine

Error Handling

Development

License

Credits

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance