Skip to main content

PDF and image to markdown converter using Mistral AI OCR

Project description

markit-mistral

A powerful PDF and image to markdown converter using Mistral AI OCR with advanced mathematical equation support.

Features

  • Convert PDF documents and images to clean markdown
  • Advanced OCR using Mistral AI for high accuracy text extraction
  • Preserve mathematical equations in LaTeX format
  • Extract and manage images alongside markdown output
  • Support for complex documents with tables, figures, and formulas
  • Command-line interface similar to markitdown
  • Web interface for browser-based processing
  • Batch processing capabilities
  • Configurable output formats

Usage Options

Web Interface (Browser-based)

For a user-friendly, browser-based experience:

  1. Open docs/index.html in your browser
  2. Enter your Mistral API key
  3. Drag and drop files or click to upload
  4. Download the generated markdown and images

Features:

  • No installation required - runs entirely in your browser
  • Privacy-focused - files never leave your device
  • Real-time progress tracking
  • Responsive design for mobile and desktop

See the Web Interface README for detailed instructions.

Command Line Interface

For automated workflows and integration:

Installation

Prerequisites

  • Python 3.10 or higher
  • Mistral AI API key

Install from PyPI (coming soon)

pip install markit-mistral

Install from Source

git clone https://github.com/yahya/markit-mistral.git
cd markit-mistral
pip install -e .

Development Installation

git clone https://github.com/yahya/markit-mistral.git
cd markit-mistral
pip install -e ".[dev]"

Quick Start

API Key Setup

Set your Mistral AI API key as an environment variable:

export MISTRAL_API_KEY="your-api-key-here"

Or pass it directly via command line:

markit-mistral document.pdf --api-key your-api-key-here

Basic Usage

Convert a PDF to markdown:

markit-mistral document.pdf

Convert with output file:

markit-mistral document.pdf -o output.md

Convert an image:

markit-mistral image.png -o result.md

Extract images alongside markdown:

markit-mistral document.pdf --extract-images -o output.md

Process from stdin:

cat document.pdf | markit-mistral > output.md

Command Line Options

usage: markit-mistral [-h] [-v] [-o OUTPUT] [--api-key API_KEY]
                      [--extract-images] [--base64-images] 
                      [--preserve-math] [--verbose] [--quiet]
                      [input]

PDF and image to markdown converter using Mistral AI OCR

positional arguments:
  input                 input file (PDF or image). If not provided, reads from stdin

options:
  -h, --help            show this help message and exit
  -v, --version         show version and exit
  -o OUTPUT, --output OUTPUT
                        output file path. If not provided, outputs to stdout
  --api-key API_KEY     Mistral API key (can also be set via MISTRAL_API_KEY environment variable)
  --extract-images      extract images to separate files alongside markdown output
  --base64-images       embed images as base64 in markdown instead of separate files
  --preserve-math       preserve mathematical equations in LaTeX format (default: True)
  --verbose             enable verbose output
  --quiet               suppress all output except errors

Mathematical Equation Support

markit-mistral automatically detects and preserves mathematical equations in multiple formats:

  • Inline math: $E = mc^2$ or \(E = mc^2\)
  • Display math: $$\frac{-b \pm \sqrt{b^2-4ac}}{2a}$$ or \[\frac{-b \pm \sqrt{b^2-4ac}}{2a}\]
  • Complex equations: Multi-line equations, matrices, chemical formulas
  • Mixed content: Documents with both text and mathematical content

Image Management

When processing documents with images, markit-mistral can:

  • Extract images to separate files in the same directory
  • Generate relative links in markdown
  • Maintain original image quality and format
  • Support base64 embedding for standalone markdown files

Example output structure:

output.md
output_images/
├── image_001.png
├── image_002.jpg
└── figure_003.png

Python API

from markit_mistral import MarkItMistral

# Initialize converter
converter = MarkItMistral(api_key="your-api-key")

# Convert a file
markdown_content = converter.convert_file("document.pdf")

# Convert with custom options
markdown_content = converter.convert_file(
    "document.pdf",
    extract_images=True,
    preserve_math=True
)

Configuration

You can configure markit-mistral through:

  1. Environment variables
  2. Command line arguments
  3. Configuration file (coming soon)

Environment Variables

  • MISTRAL_API_KEY: Your Mistral AI API key
  • MARKIT_MISTRAL_CONFIG: Path to configuration file

Supported File Formats

Input Formats

  • PDF documents
  • PNG images
  • JPEG images
  • TIFF images
  • BMP images

Output Formats

  • Markdown with LaTeX math
  • Markdown with extracted images
  • Markdown with base64 embedded images

Performance

markit-mistral is optimized for:

  • Large document processing
  • Batch operations
  • Memory-efficient streaming
  • API rate limit handling

Contributing

We welcome contributions! Please see our contributing guidelines for details.

Development Setup

  1. Clone the repository
  2. Install development dependencies: pip install -e ".[dev]"
  3. Run tests: pytest
  4. Format code: black src tests
  5. Check types: mypy src

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with the powerful Mistral AI OCR capabilities
  • Inspired by the markitdown project
  • Thanks to the open source community for various dependencies

Support

  • Report issues on GitHub Issues
  • Check the documentation for common problems
  • Join our community discussions

Roadmap

  • Plugin system for custom processors
  • Multiple output format support
  • Advanced table recognition
  • Batch processing UI
  • Cloud deployment options
  • Integration with popular document workflows

Note: This project is in active development. Some features may be experimental.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markit_mistral-0.2.0.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markit_mistral-0.2.0-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file markit_mistral-0.2.0.tar.gz.

File metadata

  • Download URL: markit_mistral-0.2.0.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markit_mistral-0.2.0.tar.gz
Algorithm Hash digest
SHA256 dab48099b1e5679bc582e45271431193004e9b83fd98d458470e0296f2808de9
MD5 04cb1951cd3f1bf3683b57acbe3a8816
BLAKE2b-256 b4f6544da85d4561b505718ab8ea6892cd436ff19350c1bc2b14fb4867d7c66f

See more details on using hashes here.

Provenance

The following attestation bundles were made for markit_mistral-0.2.0.tar.gz:

Publisher: publish.yml on neuromechanist/markit-mistral

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file markit_mistral-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: markit_mistral-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markit_mistral-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec0bc88b315defede00e9035a0cf222e7803712894aac778b021af9928aee432
MD5 53d8f3c7701a6e283322e95d3d88a391
BLAKE2b-256 234a9edff81ff97da6d4b8112b9a6051b73087a2ccb8486c7a2256aaa9512be2

See more details on using hashes here.

Provenance

The following attestation bundles were made for markit_mistral-0.2.0-py3-none-any.whl:

Publisher: publish.yml on neuromechanist/markit-mistral

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page