A CLI tool to convert PDF files to Markdown using the Mistral AI OCR API.
Project description
Mistral PDF to Markdown Converter
A simple command-line tool to convert PDF files into Markdown format using the Mistral AI OCR API. This tool also extracts embedded images and saves them in a subdirectory relative to the output markdown file.
Installation
-
Clone the repository:
git clone https://github.com/arcangelo7/mistral-pdf-to-markdown.git cd mistral-pdf-to-markdown
-
Install dependencies using Poetry:
poetry install
Usage
-
Set your Mistral API Key: You can set your API key as an environment variable:
export MISTRAL_API_KEY='your_api_key_here'
Alternatively, you can create a
.envfile in the project root directory with the following content:MISTRAL_API_KEY=your_api_key_hereYou can also pass the API key directly using the
--api-keyoption. -
Run the conversion: The main command is
convert.poetry run pdf2md convert <path/to/your/document.pdf> [options]
Or, if you have activated the virtual environment (
poetry shell):pdf2md convert <path/to/your/document.pdf> [options]
Options:
--outputor-o: Specify the path for the output Markdown file. If not provided, it defaults to the same name as the input PDF but with a.mdextension (e.g.,document.md).--api-key: Provide the Mistral API key directly.
Image Handling:
The script will attempt to extract images embedded in the PDF.
- Images are saved in a subdirectory named
<output_filename_stem>_images(e.g., if the output isreport.md, images will be inreport_images/). - The generated Markdown file will contain relative links pointing to the images in this subdirectory.
Example:
poetry run pdf2md convert ./my_report.pdf -o ./output/report.md
This command will create:
./output/report.md(the markdown content)./output/report_images/(a directory containing extracted images)
An example output generated from example.pdf (included in the repository) can be found in example.md, with its corresponding images located in the example_images/ directory.
Development
Use poetry shell to activate the virtual environment for development.
Run tests (if any) using:
poetry run pytest
License
This project is licensed under the ISC License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mistral_pdf_to_markdown-1.0.1.tar.gz.
File metadata
- Download URL: mistral_pdf_to_markdown-1.0.1.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.17 Linux/6.11.0-1012-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abda33c4d663a8fd4d4e10ffd5400a42c18873c5db6d21ecdec85231b2abe158
|
|
| MD5 |
c0fc67cd76a428ca8bd997e2432c4f7b
|
|
| BLAKE2b-256 |
618a01e7846497d97cf5f45b930a46ac3292da2f26078a54cf5be00e50a80682
|
File details
Details for the file mistral_pdf_to_markdown-1.0.1-py3-none-any.whl.
File metadata
- Download URL: mistral_pdf_to_markdown-1.0.1-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.17 Linux/6.11.0-1012-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a6b66867c9eab6577f4e5af24a33ddcf910f804c3fc40e180cb43fdbeb550f7
|
|
| MD5 |
ce66a6cb3d972032d41f95ad28a61e8e
|
|
| BLAKE2b-256 |
a85437a2ea81a23c34ecc55912513fbe4895c98e4e3f9cbe331e0d9e465c5789
|