This project contains a command line tool to convert PDF to markdown. It uses image conversion and a LLM to convert the images to markdown.
Project description
PDF to Markdown
This project contains a command line tool to convert PDF to markdown. It uses image conversion and an LLM to convert the images to markdown.
Install
Execute these commands in the base directory of this project.
On Windows download the poppler library (e.g. poppler-24.08.0) from here and then do this using PowerShell:
$env:PKG_CONFIG_PATH="<download_folder>\poppler-24.08.0\Library\lib\pkgconfig"
# conda remove -n pdf_to_markdown --all
conda create -n pdf_to_markdown python=3.13
conda activate pdf_to_markdown
pip install poetry
# Windows
pip install cmake
conda install poppler poppler-qt
# End Windows
# Linux
sudo apt update
sudo apt install g++ -y
sudo apt install pkg-config -y
sudo apt-get install poppler-utils libpoppler-cpp-dev
# End Linux
poetry install
There is an installation script for Linux in this repository.
Configuration
The application is configured used environment variables which you can set in an .env file. Check the .env_local file for the names of the variables that you will need.
You will need an Open AI key to run the PDF conversion.
Usage of the command line application
Example: how to convert multiple pdf files with the OpenAI engine:
python ./pdf_to_markdown_llm/main/cli.py convert-files -f ./pdfs/oecd/002b3a39-en.pdf -f ./pdfs/oecd/ee6587fd-en.pdf
Example: how to convert a single file with Gemini model:
python ./pdf_to_markdown_llm/main/cli.py convert-files -f ./pdfs/oecd/002b3a39-en.pdf -e gemini
Example: how to convert all pdf files in a folder:
python ./pdf_to_markdown_llm/main/cli.py convert-in-dir --dirs ./pdfs/oecd
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_to_markdown_llm-0.1.3.tar.gz.
File metadata
- Download URL: pdf_to_markdown_llm-0.1.3.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.1 CPython/3.13.1 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d8b758d7df40543f1cf71ba0ab956e90533d0728b6e8c46b7c6b5f2855c5ea2
|
|
| MD5 |
c030785f52cfa0b81a1c9231c11a00c8
|
|
| BLAKE2b-256 |
90e75af71c8013073bbd82f8d4e696003e7aa4d870fd20b09538ceaff83bfcf7
|
File details
Details for the file pdf_to_markdown_llm-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pdf_to_markdown_llm-0.1.3-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.1 CPython/3.13.1 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
429080ec7ff02d8f6baf4c7a1dd128dc3ec8d15e7aaf58140f4053a061dd62a1
|
|
| MD5 |
7cecbd15efdbf9cbfd2b03c967b5fb4d
|
|
| BLAKE2b-256 |
af138c0dcd331d066c480c109fbef3d2d8e46c4950c5813a0a57780185466f5e
|