A CLI tool to convert PDF files to Markdown using the Mistral AI OCR API.

These details have not been verified by PyPI

Project description

Mistral PDF to Markdown Converter

Poetry

A simple command-line tool to convert PDF files into Markdown format using the Mistral AI OCR API. This tool also extracts embedded images and saves them in a subdirectory relative to the output markdown file.

Installation

You can install the package directly from PyPI using pip:

pip install mistral-pdf-to-markdown

Global Installation (Recommended for CLI Usage)

If you want to use the pdf2md command from anywhere in your system without activating a specific virtual environment, the recommended way is to use pipx:

Install pipx (if you don't have it already). Follow the official pipx installation guide. A common method is:
```
python3 -m pip install --user pipx
python3 -m pipx ensurepath
```
(Restart your terminal after running ensurepath)
Install the package using pipx:
```
pipx install mistral-pdf-to-markdown
```

This installs the package in an isolated environment but makes the pdf2md command globally available.

Installation from Source

Alternatively, if you want to install from the source:

Clone the repository:

git clone https://github.com/arcangelo7/mistral-pdf-to-markdown.git
cd mistral-pdf-to-markdown

Install dependencies using Poetry:
```
poetry install
```

Usage

Set your Mistral API Key: You can set your API key as an environment variable:
```
export MISTRAL_API_KEY='your_api_key_here'
```
Alternatively, you can create a .env file in the project root directory with the following content:
```
MISTRAL_API_KEY=your_api_key_here
```
You can also pass the API key directly using the --api-key option.
Run the conversion:

Convert a Single PDF File

The convert command processes a single PDF file.
```
poetry run pdf2md convert <path/to/your/document.pdf> [options]
```
Or, if you have activated the virtual environment (poetry shell):
```
pdf2md convert <path/to/your/document.pdf> [options]
```
Options for Single File Conversion:
- --output or -o: Specify the path for the output Markdown file. If not provided, it defaults to the same name as the input PDF but with a .md extension (e.g., document.md).
- --api-key: Provide the Mistral API key directly.
Convert Multiple PDF Files from a Directory

The convert-dir command processes all PDF files in a specified directory.
```
poetry run pdf2md convert-dir <path/to/directory/with/pdfs> [options]
```
Or, if you have activated the virtual environment (poetry shell):
```
pdf2md convert-dir <path/to/directory/with/pdfs> [options]
```
Options for Directory Conversion:
- --output-dir or -o: Specify the directory where output Markdown files will be saved. If not provided, it defaults to the same directory as the input PDFs.
- --api-key: Provide the Mistral API key directly.
- --max-workers or -w: Maximum number of concurrent conversions (default: 2). Increase this value to process multiple files in parallel for faster conversion.

Image Handling:

The script will attempt to extract images embedded in the PDF.

Images are saved in a subdirectory named <output_filename_stem>_images (e.g., if the output is report.md, images will be in report_images/).
The generated Markdown file will contain relative links pointing to the images in this subdirectory.

Examples:

# Convert a single PDF file
poetry run pdf2md convert ./my_report.pdf -o ./output/report.md

This command will create:

./output/report.md (the markdown content)
./output/report_images/ (a directory containing extracted images)

# Convert all PDF files in a directory
poetry run pdf2md convert-dir ./pdf_documents/ -o ./markdown_output/ -w 4

This command will:

Process all PDF files in the ./pdf_documents/ directory
Save the resulting Markdown files in the ./markdown_output/ directory
Process up to 4 files concurrently
Create image directories for each output file as needed

An example output generated from example.pdf (included in the repository) can be found in example.md, with its corresponding images located in the example_images/ directory.

Development

Use poetry shell to activate the virtual environment for development.

Run tests (if any) using:

poetry run pytest

License

This project is licensed under the ISC License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.0

Oct 8, 2025

This version

1.1.0

Apr 30, 2025

1.0.1

Apr 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistral_pdf_to_markdown-1.1.0.tar.gz (5.2 kB view details)

Uploaded Apr 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mistral_pdf_to_markdown-1.1.0-py3-none-any.whl (6.5 kB view details)

Uploaded Apr 30, 2025 Python 3

File details

Details for the file mistral_pdf_to_markdown-1.1.0.tar.gz.

File metadata

Download URL: mistral_pdf_to_markdown-1.1.0.tar.gz
Upload date: Apr 30, 2025
Size: 5.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.10.17 Linux/6.11.0-1012-azure

File hashes

Hashes for mistral_pdf_to_markdown-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4d430d8ac6ebe77f29bb108c51ed0d082cac3b9226c8118e4545d58915cd10dd`
MD5	`ed8bf9db774d331d61c64ad9ca3bc82f`
BLAKE2b-256	`7b717515914a0f5ecf7cdcb0e9d3b83f7956db527f1ce1217b13189eded985ce`

See more details on using hashes here.

File details

Details for the file mistral_pdf_to_markdown-1.1.0-py3-none-any.whl.

File metadata

Download URL: mistral_pdf_to_markdown-1.1.0-py3-none-any.whl
Upload date: Apr 30, 2025
Size: 6.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.10.17 Linux/6.11.0-1012-azure

File hashes

Hashes for mistral_pdf_to_markdown-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d728f60b3bc7ecee63e549a840c4c21a94a9750243f6bf7292fe8e4d73633dd`
MD5	`218f1c92aa86f03cdbb2b146cba472f1`
BLAKE2b-256	`a12b3a59e0e99fa3880f53c691d0154241ea557f1e5ffc0656b749b2c346327a`

See more details on using hashes here.

mistral-pdf-to-markdown 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Mistral PDF to Markdown Converter

Installation

Global Installation (Recommended for CLI Usage)

Installation from Source

Usage

Convert a Single PDF File

Convert Multiple PDF Files from a Directory

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes