Skip to main content

This project contains a command line tool to convert PDF to markdown. It uses image conversion and a LLM to convert the images to markdown.

Project description

PDF to Markdown

This project contains a command line tool to convert PDF and Word documents to markdown. It uses image conversion and an LLM to convert the images to markdown.

Install

Execute these commands in the base directory of this project.

On Windows download the poppler library (e.g. poppler-24.08.0) from here and then do this using PowerShell:

$env:PKG_CONFIG_PATH="<download_folder>\poppler-24.08.0\Library\lib\pkgconfig"
uv venv
.venv\Scripts\activate
pip install cmake
uv sync
# conda remove -n pdf_to_markdown --all
uv venv
source .venv/bin/activate
uv sync
# Linux
sudo apt update
sudo apt install g++ -y
sudo apt install pkg-config -y
sudo apt-get install poppler-utils libpoppler-cpp-dev
# End Linux

There is an installation script for Linux in this repository.

Configuration

The application is configured used environment variables which you can set in an .env file. Check the .env_local file for the names of the variables that you will need.

You will need an Open AI key to run the PDF conversion.

You will also need a Gemini API key.

So you will need two environment variables:

OPENAI_API_KEY GEMINI_API_KEY

Usage of the command line application

Example: how to convert multiple pdf files with the OpenAI engine:

python ./pdf_to_markdown_llm/main/cli.py convert-files -f ./pdfs/oecd/002b3a39-en.pdf -f ./pdfs/oecd/ee6587fd-en.pdf

Example: how to convert a Word file to markdown with the OpenAI engine:

python ./pdf_to_markdown_llm/main/cli.py convert-files -f "./docs/Explainability March 2025.docx"

Example: how to convert a Word file to html with the OpenAI engine:

python ./pdf_to_markdown_llm/main/cli.py convert-files -f "./docs/bk/Pour INSCRIPTION en ligne MARCORIGNAN .docx" -t html

Example: how to convert a single file with Gemini model:

python ./pdf_to_markdown_llm/main/cli.py convert-files -f ./pdfs/oecd/002b3a39-en.pdf -e gemini

Example: how to convert all pdf files in a folder:

python ./pdf_to_markdown_llm/main/cli.py convert-in-dir --dirs ./pdfs/oecd

Publishing

uv build
uv publish

Note about upgrading libraries

uv lock --python 3.14 --upgrade-package pydantic-core --upgrade-package pydantic

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_to_markdown_llm-0.1.17.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_to_markdown_llm-0.1.17-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf_to_markdown_llm-0.1.17.tar.gz.

File metadata

  • Download URL: pdf_to_markdown_llm-0.1.17.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdf_to_markdown_llm-0.1.17.tar.gz
Algorithm Hash digest
SHA256 3e7fe4b5536c7498d012e3cda77a24cb6dd5e28fe8b5a5a1e9fc127f282fcca4
MD5 0805f0b42dae89005a582ae6dbab70bb
BLAKE2b-256 2cd430932bda8b6bd9fa6991257f8ce53b390032daf14ab1f19a13369b69cfba

See more details on using hashes here.

File details

Details for the file pdf_to_markdown_llm-0.1.17-py3-none-any.whl.

File metadata

  • Download URL: pdf_to_markdown_llm-0.1.17-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdf_to_markdown_llm-0.1.17-py3-none-any.whl
Algorithm Hash digest
SHA256 504cec1e9fb7a8675f7c28a3d963b64c21b77d68365f0e1ef68a465ef428e7b0
MD5 593cf5212e7202fc5d4c1f775e027cf9
BLAKE2b-256 dfa309ede3dc0941a5c557f1c3bb2a21eed8e154d8bc9a6deb442019b78b6e40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page