Universal document to Markdown converter with quality/fast dual-mode pipeline
Project description
Any2MD - 格式工厂分发引擎
Universal document to Markdown converter with quality/fast dual-mode pipeline
Any2MD converts documents to Markdown with two processing modes:
- Quality Mode: Uses specialized converters (pypandoc, mammoth, markdownify) for superior output
- Fast Mode: Uses markitdown for quick "good enough" results
Features
- Dual-Mode Pipeline: Quality vs Fast mode for optimal results
- PDF Heavy Channel: MarkItDown (primary) + pypdfium2 (fallback)
- Async Batch Processing: Concurrent file conversion with Semaphore control
- Smart Retry: Tenacity-based retry for transient failures
- Non-Destructive Output: Auto-incremented filenames, never overwrites
- Organized Output: Folder batch outputs to
{folder}_converted/ - BAT Drag-and-Drop: Zero-config usage on Windows
Quick Start
Installation
# From source (development)
git clone https://github.com/1StepMore/Any2MD.git
cd Any2MD
pip install -e .
# From PyPI
pip install Any2MD-1StepMore
Usage
Drag-and-Drop (Windows)
1. Drag a file onto bat\run.bat
2. Done! Markdown appears next to original
Drag-and-Drop Folder (Windows)
1. Drag a folder onto bat\run.bat
2. Done! Markdown files appear in {folder}_converted/
Command Line
# Single file conversion
python cli.py -i document.pdf
# With quality mode (better output)
python cli.py -i document.docx --mode quality
# Batch folder processing
python cli.py -i ./documents --concurrency 4
# Verbose logging
python cli.py -i document.pdf --verbose
# Full options
python cli.py --help
Python API
from any2md import convert_to_markdown
# Basic usage
md = convert_to_markdown("document.pdf")
print(md)
# With options
md = convert_to_markdown("document.docx", mode="quality")
md = convert_to_markdown("document.pdf", mode="fast", pdf_engine="heavy")
# Error handling
from any2md import (
convert_to_markdown,
FileTooLargeError,
UnsupportedFormatError,
ConversionError,
)
try:
md = convert_to_markdown("document.pdf")
except FileTooLargeError:
print("File too large (max 50MB)")
except UnsupportedFormatError:
print("Format not supported")
except ConversionError as e:
print(f"Conversion failed: {e}")
CLI Options
| Option | Short | Description | Default |
|---|---|---|---|
--input |
-i |
Input file or folder | Required |
--output |
-o |
Output directory | Same as input |
--mode |
- | quality or fast |
From config |
--pdf-engine |
- | light (markitdown) or heavy (MarkItDown) |
From config |
--concurrency |
- | Max parallel conversions | From config |
--config |
- | Config file path | config.yaml |
--verbose |
-v |
Enable debug logging | false |
Configuration
Edit config.yaml:
# Output mode: quality (best) or fast (quick)
output_mode: fast
# PDF engine: light (markitdown) or heavy (MarkItDown -> pypdfium2)
pdf_engine: light
# Max file size (MB) - files larger are skipped
max_file_size: 50
# Async concurrency level
concurrency: 4
# Retry attempts for transient failures
retry_count: 3
Tip: You can also access config programmatically via
from wheels.config import get_config
Architecture
Any2MD/
├── bat/
│ └── run.bat # Windows drag-and-drop entry
├── wheels/
│ ├── converters/ # Format-specific converters
│ │ ├── converter_pandoc.py # Quality: docx/pptx/xlsx
│ │ ├── converter_mammoth.py # Quality: complex DOCX
│ │ ├── converter_html.py # Quality: HTML
│ │ ├── converter_pdf.py # Dynamic: PDF (light/heavy)
│ │ └── converter_passthrough.py # .md/.txt passthrough
│ │ └── converter_markitdown.py # Fast: csv/json/xml/yaml/epub/zip
│ ├── dispatcher.py # Format routing
│ ├── fast_lane.py # markitdown wrapper
│ ├── cleaner.py # Text post-processing
│ ├── logger.py # Logging configuration
│ └── exceptions.py # Exception hierarchy
├── any2md/ # Public package wrapper
│ └── __init__.py # Re-exports convert_to_markdown + exceptions
├── tests/ # Test suite (83 tests)
│ ├── test_api.py
│ ├── test_exceptions.py
│ ├── test_config.py
│ ├── test_dispatcher.py
│ └── test_integration.py
├── .github/workflows/ci.yml # GitHub Actions CI/CD
├── cli.py # Typer CLI entry
├── pipeline.py # Async batch orchestration
├── pyproject.toml # Package metadata (pip install -e .)
├── config.yaml # Configuration
├── requirements.txt # Dependencies
├── CHANGELOG.md # Version history
└── CONTRIBUTING.md # CI/CD release process
Supported Formats
| Format | Quality Mode | Fast Mode |
|---|---|---|
| .docx | pypandoc → mammoth fallback | markitdown |
| .pptx / .xlsx | pypandoc | markitdown |
| .html / .htm | markdownify | markitdown |
| MarkItDown (heavy) or markitdown (light) | markitdown | |
| .md / .txt | passthrough | passthrough |
| .csv / .json / .xml | - | markitdown |
| .yaml / .yml | - | markitdown |
| .epub | - | markitdown |
| .zip | - | markitdown |
PDF Engine
Light Mode: Uses markitdown with zero external dependencies.
Heavy Mode: Uses Microsoft MarkItDown for table/layout detection, falls back to pypdfium2 for raw text extraction.
Install heavy dependencies:
# Ubuntu/Debian
sudo apt install tesseract-ocr poppler-utils
# Windows (requires admin)
winget install --id UB-Mannheim.TesseractOCR -e
winget install --id oschwartz10612.Poppler -e
Error Handling
- File too large: Files >50MB are skipped with error log
- Unsupported format: Clear error message with supported formats
- Missing dependencies: Install instructions in error message
- Transient failures: Automatic retry (up to 3 attempts)
- Batch continuation: Single file failure doesn't stop batch processing
Development
# Run CLI help
python cli.py --help
# Test with sample files
python cli.py -i sample.docx --mode quality
# Check environment
python check_env.py
License
MIT License - see LICENSE file for details
Credits
Built with these excellent open-source libraries:
- pypandoc - Pandoc wrapper
- mammoth - DOCX to Markdown
- markdownify - HTML to Markdown
- markitdown - Microsoft format converter
- pypdfium2 - PDF rendering
- tenacity - Retry logic
- Typer - CLI framework
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file any2md_1stepmore-0.3.5.tar.gz.
File metadata
- Download URL: any2md_1stepmore-0.3.5.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7178f2559538632e3c2d70d0662f743dabdda20683c74fb4937e6aa06416050a
|
|
| MD5 |
e222f1d3749927cc68c0712fe8ac72cc
|
|
| BLAKE2b-256 |
7d2eb7db3880dfbe4ca126858e6ad24c215a26c662c3336fd972c878411108d2
|
Provenance
The following attestation bundles were made for any2md_1stepmore-0.3.5.tar.gz:
Publisher:
ci.yml on 1StepMore/Any2MD
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
any2md_1stepmore-0.3.5.tar.gz -
Subject digest:
7178f2559538632e3c2d70d0662f743dabdda20683c74fb4937e6aa06416050a - Sigstore transparency entry: 1506669598
- Sigstore integration time:
-
Permalink:
1StepMore/Any2MD@3afbea8f79e4539dcd49af26835170271a784ae2 -
Branch / Tag:
refs/tags/v0.3.5 - Owner: https://github.com/1StepMore
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@3afbea8f79e4539dcd49af26835170271a784ae2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file any2md_1stepmore-0.3.5-py3-none-any.whl.
File metadata
- Download URL: any2md_1stepmore-0.3.5-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
683ce3dfdb35c154138c9ef253341a14e19af9bae1dbb96a0696f4a8dfce2fca
|
|
| MD5 |
2df15a1ff159bba3a346dfb4e5f1b524
|
|
| BLAKE2b-256 |
4be2329506cdf7bcf56ef6bb73e56b75ac91beac7537bf1895327a9c794bc856
|
Provenance
The following attestation bundles were made for any2md_1stepmore-0.3.5-py3-none-any.whl:
Publisher:
ci.yml on 1StepMore/Any2MD
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
any2md_1stepmore-0.3.5-py3-none-any.whl -
Subject digest:
683ce3dfdb35c154138c9ef253341a14e19af9bae1dbb96a0696f4a8dfce2fca - Sigstore transparency entry: 1506669671
- Sigstore integration time:
-
Permalink:
1StepMore/Any2MD@3afbea8f79e4539dcd49af26835170271a784ae2 -
Branch / Tag:
refs/tags/v0.3.5 - Owner: https://github.com/1StepMore
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@3afbea8f79e4539dcd49af26835170271a784ae2 -
Trigger Event:
push
-
Statement type: