Skip to main content

A CLI tool to convert PDF files to Markdown using the Mistral AI OCR API.

Project description

Mistral PDF to Markdown Converter

PyPI version Poetry

A simple command-line tool to convert PDF and EPUB files into Markdown format using the Mistral AI OCR API. This tool also extracts embedded images and saves them in a subdirectory relative to the output markdown file.

Installation

You can install the package directly from PyPI using pip:

pip install mistral-pdf-to-markdown

Global Installation (Recommended for CLI Usage)

If you want to use the pdf2md command from anywhere in your system without activating a specific virtual environment, the recommended way is to use pipx:

  1. Install pipx (if you don't have it already). Follow the official pipx installation guide. A common method is:

    python3 -m pip install --user pipx
    python3 -m pipx ensurepath
    

    (Restart your terminal after running ensurepath)

  2. Install the package using pipx:

    pipx install mistral-pdf-to-markdown
    

This installs the package in an isolated environment but makes the pdf2md command globally available.

Installation from Source

Alternatively, if you want to install from the source:

  1. Clone the repository:

    git clone https://github.com/arcangelo7/mistral-pdf-to-markdown.git
    cd mistral-pdf-to-markdown
    
  2. Install dependencies using Poetry:

    poetry install
    

Additional Requirements for EPUB Support

To convert EPUB files, you need to install pandoc. See the official installation guide for your operating system.

Usage

  1. Set your Mistral API Key: You can set your API key as an environment variable:

    export MISTRAL_API_KEY='your_api_key_here'
    

    Alternatively, you can create a .env file in the project root directory with the following content:

    MISTRAL_API_KEY=your_api_key_here
    

    You can also pass the API key directly using the --api-key option.

  2. Run the conversion:

    Convert a Single PDF or EPUB File

    The convert command processes a single PDF or EPUB file.

    poetry run pdf2md convert <path/to/your/document.pdf> [options]
    

    Or, if you have activated the virtual environment (poetry shell):

    pdf2md convert <path/to/your/document.pdf> [options]
    

    Options for Single File Conversion:

    • --output or -o: Specify the path for the output Markdown file. If not provided, it defaults to the same name as the input file but with a .md extension (e.g., document.md).
    • --api-key: Provide the Mistral API key directly.

    Convert Multiple PDF and EPUB Files from a Directory

    The convert-dir command processes all PDF and EPUB files in a specified directory.

    poetry run pdf2md convert-dir <path/to/directory/with/files> [options]
    

    Or, if you have activated the virtual environment (poetry shell):

    pdf2md convert-dir <path/to/directory/with/files> [options]
    

    Options for Directory Conversion:

    • --output-dir or -o: Specify the directory where output Markdown files will be saved. If not provided, it defaults to the same directory as the input files.
    • --api-key: Provide the Mistral API key directly.
    • --max-workers or -w: Maximum number of concurrent conversions (default: 2). Increase this value to process multiple files in parallel for faster conversion.

Image Handling:

The script will attempt to extract images embedded in the document.

  • Images are saved in a subdirectory named <output_filename_stem>_images (e.g., if the output is report.md, images will be in report_images/).
  • The generated Markdown file will contain relative links pointing to the images in this subdirectory.

Examples:

# Convert a single PDF file (output: ./my_report.md)
poetry run pdf2md convert ./my_report.pdf

# Convert with custom output path
poetry run pdf2md convert ./my_report.pdf -o ./output/report.md

# Convert all files in a directory with 4 concurrent workers
poetry run pdf2md convert-dir ./documents/ -o ./markdown_output/ -w 4

An example output generated from example.pdf (included in the repository) can be found in example.md, with its corresponding images located in the example_images/ directory.

License

This project is licensed under the ISC License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistral_pdf_to_markdown-1.2.0.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mistral_pdf_to_markdown-1.2.0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file mistral_pdf_to_markdown-1.2.0.tar.gz.

File metadata

  • Download URL: mistral_pdf_to_markdown-1.2.0.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.18 Linux/6.11.0-1018-azure

File hashes

Hashes for mistral_pdf_to_markdown-1.2.0.tar.gz
Algorithm Hash digest
SHA256 2face207e674354190667bcc9573cfc0a2f1b017e4e61dc4c301c34808076235
MD5 64eda0b42d9417bc4d6648fe40d4664e
BLAKE2b-256 15b071e2cee4607cef1193e64b180ade8fe90e4a9a16a3b20cd3427c5a8b72f7

See more details on using hashes here.

File details

Details for the file mistral_pdf_to_markdown-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mistral_pdf_to_markdown-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b274ec293183b2151ce53ea8ffca0565fec602e37bf33c04ff1888c20399109b
MD5 2f66ee81a1bb5dd166f5015fd54f7673
BLAKE2b-256 f0f494d2c5d2b433cd31e24e2275fb56b0c75a8e90d6ece02eddadcd40f419af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page