A Python library for converting images and PDFs to Markdown or generating rich image descriptions using state-of-the-art multimodal LLMs
Project description
MarkThat
A Python library for converting images and PDFs to Markdown or generating rich image descriptions using state-of-the-art multimodal LLMs.
๐ Features
- Multiple Provider Support: OpenAI, Anthropic, Google Gemini, Mistral, and OpenRouter
- Dual Mode Operation: Convert to Markdown or generate detailed descriptions
- Advanced Figure Extraction: Automatically detect, extract, and process figures from PDFs
- Robust Retry Logic: Intelligent retry with fallback models and failure feedback
- Async Support: Concurrent processing for improved performance
- Clean architecture: Type-safe, well-documented, and thoroughly tested
- Easy Integration: Simple API with comprehensive configuration options
๐ฆ Option 1: Install from PyPI
pip install markthat
Option 2: Development Installation
git clone https://github.com/Flopsky/markthat.git
cd markthat
pip install -e .
pre-commit install
๐ Quick Start
Basic Usage
from markthat import MarkThat
# Initialize with your preferred model
converter = MarkThat(
model="gemini-2.0-flash-001",
provider="gemini",
api_key="YOUR_API_KEY"
)
# Convert image to markdown
result = converter.convert("path/to/image.jpg")
print(result[0])
# Generate image description
description = converter.convert(
"path/to/image.jpg",
description_mode=True
)
print(description[0])
Updated Examples from examples/basic_usage.py
from markthat import MarkThat
from dotenv import load_dotenv
import os
import asyncio
load_dotenv()
def test_markthat_with_figure_extraction():
"""Test MarkThat with advanced figure extraction capabilities."""
try:
client = MarkThat(
provider="gemini",
model="gemini-2.0-flash-001",
api_key=os.getenv("GEMINI_API_KEY"),
api_key_figure_detector=os.getenv("GEMINI_API_KEY"),
api_key_figure_extractor=os.getenv("GEMINI_API_KEY"),
api_key_figure_parser=os.getenv("GEMINI_API_KEY"),
)
result = asyncio.run(
client.async_convert(
"path/to/document.pdf",
extract_figure=True,
coordinate_model="gemini-2.0-flash-001",
parsing_model="gemini-2.5-flash-lite",
)
)
return result
except Exception as e:
print("Figure extraction failed:", e)
return None
def test_markthat_without_figure_extraction():
"""Test standard MarkThat conversion without figure extraction."""
try:
client = MarkThat(
provider="gemini",
model="gemini-2.0-flash-001",
api_key=os.getenv("GEMINI_API_KEY"),
)
result = asyncio.run(
client.async_convert(
"path/to/document.pdf",
extract_figure=False,
)
)
return result
except Exception as e:
print("Standard conversion failed:", e)
return None
if __name__ == "__main__":
# Test both approaches
with_figures = test_markthat_with_figure_extraction()
without_figures = test_markthat_without_figure_extraction()
print("With figure extraction:", with_figures)
print("Without figure extraction:", without_figures)
๐ฅ๏ธ Gradio UI (Visual App)
Quickly try MarkThat in your browser.
pip install -r requirements.txt # ensures gradio is installed
python gradio_ui.py
Then open http://localhost:7861 in your browser.
- Supports multiple providers with per-step model overrides
- Lets you pass provider-specific API keys (auto-fills from env when available)
- Exports results as Markdown or JSON with detected figure paths
๐ง Advanced Configuration
Provider-Specific Setup
from markthat import MarkThat, RetryPolicy
# Custom retry policy
retry_policy = RetryPolicy(
max_attempts=5,
timeout_seconds=30,
backoff_factor=1.5
)
# Multi-provider setup with fallbacks
converter = MarkThat(
model="gpt-4o",
provider="openai",
fallback_models=["claude-3-5-sonnet-20241022", "gemini-2.0-flash-001"],
retry_policy=retry_policy,
api_key="YOUR_OPENAI_KEY"
)
OpenRouter Integration
# Access 300+ models through OpenRouter
converter = MarkThat(
model="anthropic/claude-3.5-sonnet",
provider="openrouter",
api_key="YOUR_OPENROUTER_KEY"
)
# Or use model path auto-detection
converter = MarkThat(
model="openai/gpt-4o", # Automatically uses OpenRouter
api_key="YOUR_OPENROUTER_KEY"
)
๐ฏ Figure Extraction Pipeline
MarkThat includes a sophisticated figure extraction system for PDFs:
converter = MarkThat(
model="gemini-2.0-flash-001",
api_key_figure_detector="DETECTOR_KEY",
api_key_figure_extractor="EXTRACTOR_KEY",
api_key_figure_parser="PARSER_KEY"
)
results = await converter.async_convert(
"research_paper.pdf",
extract_figure=True,
figure_detector_model="gemini-2.0-flash",
coordinate_model="gemini-2.0-flash-001",
parsing_model="gemini-2.5-flash-lite"
)
How Figure Extraction Works
- Detection: Analyzes document content to identify pages with figures
- Coordinate Mapping: Overlays coordinate grids and identifies figure boundaries
- Extraction: Crops figures using precise coordinate mapping
- Integration: Embeds figure paths into the final markdown output
โก Async Processing
For optimal performance with multi-page documents:
import asyncio
from markthat import MarkThat
async def process_document():
converter = MarkThat(model="gemini-2.0-flash-001")
# Process pages concurrently
results = await converter.async_convert("large_document.pdf")
for i, page_content in enumerate(results):
print(f"Page {i+1}: {len(page_content)} characters")
asyncio.run(process_document())
๐ Environment Variables
# Primary providers (used automatically if constructor api_key is not provided)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
export GEMINI_API_KEY="your_google_key"
export MISTRAL_API_KEY="your_mistral_key"
# Unified access via OpenRouter
export OPENROUTER_API_KEY="your_openrouter_key"
Note: For figure extraction you can pass separate keys via the constructor
parameters api_key_figure_detector, api_key_figure_extractor, and
api_key_figure_parser. If omitted, they default to the main api_key.
๐งช Testing
# Run the test suite
pytest
# Run with coverage
pytest --cov=markthat
# Run a specific test file
pytest tests/test_validation.py
๐ Project Structure
markthat/
โโโ markthat/
โ โโโ __init__.py # Public API
โ โโโ client.py # Main MarkThat class
โ โโโ providers.py # LLM provider abstractions
โ โโโ file_processor.py # PDF/image loading
โ โโโ image_processing.py # Image manipulation
โ โโโ figure_extraction.py # Figure detection & extraction
โ โโโ prompts/ # Prompt templates & utilities
โ โโโ utils/ # Validation & helpers
โ โโโ exceptions.py # Custom exceptions
โ โโโ logging_config.py # Logging setup
โโโ gradio_ui.py # Visual demo app
โโโ tests/ # Test suite
โโโ examples/ # Usage examples
โโโ pyproject.toml # Project metadata
โโโ README.md # This file
๐ ๏ธ Development
Code Quality
This project uses modern Python development practices:
- Type Hints: Full type annotations with mypy validation
- Code Formatting: Black for consistent code style
- Linting: Ruff for fast, comprehensive linting
- Import Sorting: isort for organized imports
- Pre-commit Hooks: Automated quality checks
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes with proper tests
- Run quality checks:
pre-commit run --all-files - Submit a pull request
Development Setup
# Install development dependencies
pip install -e .[dev]
# Set up pre-commit hooks
pre-commit install
# Run quality checks
black .
ruff check .
isort .
mypy markthat
๐ API Reference
MarkThat Class
class MarkThat:
def __init__(
self,
model: str,
*,
provider: Optional[str] = None,
fallback_models: Optional[Sequence[str]] = None,
retry_policy: Optional[RetryPolicy] = None,
api_key: Optional[str] = None,
api_key_figure_detector: Optional[str] = None,
api_key_figure_extractor: Optional[str] = None,
api_key_figure_parser: Optional[str] = None,
max_retry: int = 3,
) -> None: ...
def convert(
self,
file_path: str,
*,
format_options: Optional[Dict[str, Any]] = None,
additional_instructions: Optional[str] = None,
description_mode: bool = False,
extract_figure: bool = False,
figure_detector_model: str = "gemini-2.0-flash",
coordinate_model: str = "gemini-2.0-flash",
parsing_model: str = "gemini-2.5-flash-lite",
max_retry: Optional[int] = None,
clean_output: bool = True,
) -> List[str]: ...
async def async_convert(
self,
file_path: str,
*,
format_options: Optional[Dict[str, Any]] = None,
additional_instructions: Optional[str] = None,
description_mode: bool = False,
extract_figure: bool = False,
figure_detector_model: str = "gemini-2.0-flash",
coordinate_model: str = "gemini-2.0-flash",
parsing_model: str = "gemini-2.5-flash-lite",
max_retry: Optional[int] = None,
clean_output: bool = True,
) -> List[str]: ...
RetryPolicy Configuration
@dataclass
class RetryPolicy:
max_attempts: int = 3
timeout_seconds: int = 30
backoff_factor: float = 1.0
๐ Supported Models
Direct Provider Access
- OpenAI: gpt-4o, gpt-4-turbo, gpt-4o-mini
- Anthropic: claude-3-5-sonnet-20241022, claude-3-opus, claude-3-haiku
- Google: gemini-2.0-flash-001, gemini-1.5-pro, gemini-1.5-flash
- Mistral: mistral-large-latest, mistral-medium, mistral-small
OpenRouter Models (300+)
- Meta: meta-llama/llama-3.2-90b-vision
- Qwen: qwen/qwen-2-vl-72b-instruct
- Many more: Access the full catalog at OpenRouter
๐ Error Handling
MarkThat provides comprehensive error handling:
from markthat import MarkThat
from markthat.exceptions import ProviderInitializationError, ConversionError
try:
converter = MarkThat(model="invalid-model")
except ProviderInitializationError as e:
print(f"Provider setup failed: {e}")
try:
result = converter.convert("image.jpg")
except ConversionError as e:
print(f"Conversion failed: {e}")
๐ Performance Tips
- Use Async for Multiple Pages:
async_convert()processes pages concurrently - Configure Appropriate Timeouts: Balance speed vs. reliability
- Choose the Right Model: Faster models for simple tasks, powerful models for complex content
- Leverage Fallbacks: Set up model hierarchies for reliability
๐ Roadmap
- โ Multi-provider LLM support
- โ PDF processing with figure extraction
- โ Async processing capabilities
- โ Comprehensive retry logic
- โ Type-safe, clean architecture
- ๐ Additional file format support (TIFF, WEBP)
- ๐ Cost tracking and optimization
- ๐ Batch processing API
- ๐ Custom prompt template system
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built with modern Python best practices
- Leverages state-of-the-art multimodal LLMs
- Inspired by the need for robust document processing tools
๐ฌ Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See
docs/for Sphinx sources
MarkThat - Transform visual content into structured text with the power of AI ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markthat-1.2.14.tar.gz.
File metadata
- Download URL: markthat-1.2.14.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8f4de3faff18bb576a0500d4f17bc30c79f098d269bc205428aac58abb177a2
|
|
| MD5 |
ca76c878b021a4da56eed1970182053e
|
|
| BLAKE2b-256 |
0f8cb41725e811e3934c29ad0b7a6ddc679ff154ffe194f51d99f31695e54033
|
Provenance
The following attestation bundles were made for markthat-1.2.14.tar.gz:
Publisher:
release.yml on Flopsky/markthat
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markthat-1.2.14.tar.gz -
Subject digest:
c8f4de3faff18bb576a0500d4f17bc30c79f098d269bc205428aac58abb177a2 - Sigstore transparency entry: 369776005
- Sigstore integration time:
-
Permalink:
Flopsky/markthat@c0fd87e9b74ba5bd0253f7dc25d75012f9b00e15 -
Branch / Tag:
refs/tags/v1.2.14 - Owner: https://github.com/Flopsky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c0fd87e9b74ba5bd0253f7dc25d75012f9b00e15 -
Trigger Event:
release
-
Statement type:
File details
Details for the file markthat-1.2.14-py3-none-any.whl.
File metadata
- Download URL: markthat-1.2.14-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92acaa0adbe1568791d6a05c19def5ead95be5814f01f79be8fb56cae80d5dc8
|
|
| MD5 |
6debd66c58e93e12cfc21dd4a9752ea0
|
|
| BLAKE2b-256 |
67b5b3257905f0e460ff70c5b937fed1514ba592052663d5da01771fcaf7fde8
|
Provenance
The following attestation bundles were made for markthat-1.2.14-py3-none-any.whl:
Publisher:
release.yml on Flopsky/markthat
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markthat-1.2.14-py3-none-any.whl -
Subject digest:
92acaa0adbe1568791d6a05c19def5ead95be5814f01f79be8fb56cae80d5dc8 - Sigstore transparency entry: 369776048
- Sigstore integration time:
-
Permalink:
Flopsky/markthat@c0fd87e9b74ba5bd0253f7dc25d75012f9b00e15 -
Branch / Tag:
refs/tags/v1.2.14 - Owner: https://github.com/Flopsky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c0fd87e9b74ba5bd0253f7dc25d75012f9b00e15 -
Trigger Event:
release
-
Statement type: