Skip to main content

This project contains a command line tool to convert PDF to markdown. It uses image conversion and a LLM to convert the images to markdown.

Project description

PDF to Markdown

This project contains a command line tool to convert PDF to markdown. It uses image conversion and an LLM to convert the images to markdown.

Install

Execute these commands in the base directory of this project.

On Windows download the poppler library (e.g. poppler-24.08.0) from here and then do this using PowerShell:

$env:PKG_CONFIG_PATH="<download_folder>\poppler-24.08.0\Library\lib\pkgconfig"
# conda remove -n pdf_to_markdown --all
conda create -n pdf_to_markdown python=3.13
conda activate pdf_to_markdown
pip install poetry
# Windows
pip install cmake
conda install poppler poppler-qt
# End Windows
# Linux
sudo apt update
sudo apt install g++ -y
sudo apt install pkg-config -y
sudo apt-get install poppler-utils libpoppler-cpp-dev
# End Linux
poetry install

There is an installation script for Linux in this repository.

Configuration

The application is configured used environment variables which you can set in an .env file. Check the .env_local file for the names of the variables that you will need.

You will need an Open AI key to run the PDF conversion.

You will also need a Gemini API key.

So you will need two environment variables:

OPENAI_API_KEY GEMINI_API_KEY

Usage of the command line application

Example: how to convert multiple pdf files with the OpenAI engine:

python ./pdf_to_markdown_llm/main/cli.py convert-files -f ./pdfs/oecd/002b3a39-en.pdf -f ./pdfs/oecd/ee6587fd-en.pdf

Example: how to convert a single file with Gemini model:

python ./pdf_to_markdown_llm/main/cli.py convert-files -f ./pdfs/oecd/002b3a39-en.pdf -e gemini

Example: how to convert all pdf files in a folder:

python ./pdf_to_markdown_llm/main/cli.py convert-in-dir --dirs ./pdfs/oecd

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_to_markdown_llm-0.1.5.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_to_markdown_llm-0.1.5-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file pdf_to_markdown_llm-0.1.5.tar.gz.

File metadata

  • Download URL: pdf_to_markdown_llm-0.1.5.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.13.1 Windows/11

File hashes

Hashes for pdf_to_markdown_llm-0.1.5.tar.gz
Algorithm Hash digest
SHA256 2617cb77df5f9a6fd093fc81fb650c4b3cce4f66a2b618232bdd7c95c7fa970b
MD5 731f748a1d854f4a4bb6289dd78f6463
BLAKE2b-256 cfaeeadc75e7d2990d05ae70e6a09aeea1e89c8956c449af06b48bd3a1442adc

See more details on using hashes here.

File details

Details for the file pdf_to_markdown_llm-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_to_markdown_llm-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3558b3c1b0434c9e53b7bff3f3ff839f9102a4bb6873e0c73c0f9e70c2b55e18
MD5 64501c874d77c83b2dea87a6052693b4
BLAKE2b-256 b304af21eb0b26f1b9edad79fd55b256495ab6076e2a25916910d3b59ffe962a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page