Skip to main content

A CLI tool to convert PDF files to Markdown using the Mistral AI OCR API.

Project description

Mistral PDF to Markdown Converter

A simple command-line tool to convert PDF files into Markdown format using the Mistral AI OCR API. This tool also extracts embedded images and saves them in a subdirectory relative to the output markdown file.

Installation

  1. Clone the repository:

    git clone https://github.com/arcangelo7/mistral-pdf-to-markdown.git
    cd mistral-pdf-to-markdown 
    
  2. Install dependencies using Poetry:

    poetry install
    

Usage

  1. Set your Mistral API Key: You can set your API key as an environment variable:

    export MISTRAL_API_KEY='your_api_key_here'
    

    Alternatively, you can create a .env file in the project root directory with the following content:

    MISTRAL_API_KEY=your_api_key_here
    

    You can also pass the API key directly using the --api-key option.

  2. Run the conversion: The main command is convert.

    poetry run pdf2md convert <path/to/your/document.pdf> [options]
    

    Or, if you have activated the virtual environment (poetry shell):

    pdf2md convert <path/to/your/document.pdf> [options]
    

Options:

  • --output or -o: Specify the path for the output Markdown file. If not provided, it defaults to the same name as the input PDF but with a .md extension (e.g., document.md).
  • --api-key: Provide the Mistral API key directly.

Image Handling:

The script will attempt to extract images embedded in the PDF.

  • Images are saved in a subdirectory named <output_filename_stem>_images (e.g., if the output is report.md, images will be in report_images/).
  • The generated Markdown file will contain relative links pointing to the images in this subdirectory.

Example:

poetry run pdf2md convert ./my_report.pdf -o ./output/report.md

This command will create:

  • ./output/report.md (the markdown content)
  • ./output/report_images/ (a directory containing extracted images)

An example output generated from example.pdf (included in the repository) can be found in example.md, with its corresponding images located in the example_images/ directory.

Development

Use poetry shell to activate the virtual environment for development.

Run tests (if any) using:

poetry run pytest

License

This project is licensed under the ISC License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistral_pdf_to_markdown-1.0.1.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mistral_pdf_to_markdown-1.0.1-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file mistral_pdf_to_markdown-1.0.1.tar.gz.

File metadata

  • Download URL: mistral_pdf_to_markdown-1.0.1.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.17 Linux/6.11.0-1012-azure

File hashes

Hashes for mistral_pdf_to_markdown-1.0.1.tar.gz
Algorithm Hash digest
SHA256 abda33c4d663a8fd4d4e10ffd5400a42c18873c5db6d21ecdec85231b2abe158
MD5 c0fc67cd76a428ca8bd997e2432c4f7b
BLAKE2b-256 618a01e7846497d97cf5f45b930a46ac3292da2f26078a54cf5be00e50a80682

See more details on using hashes here.

File details

Details for the file mistral_pdf_to_markdown-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for mistral_pdf_to_markdown-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2a6b66867c9eab6577f4e5af24a33ddcf910f804c3fc40e180cb43fdbeb550f7
MD5 ce66a6cb3d972032d41f95ad28a61e8e
BLAKE2b-256 a85437a2ea81a23c34ecc55912513fbe4895c98e4e3f9cbe331e0d9e465c5789

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page