A package to convert PDF files to Markdown using a local LLM.
Project description
pdf2md_llm
pdf2md_llm is a Python package that converts PDF files to Markdown using a local Large Language Model (LLM).
The package leverages the pdf2image library to convert PDF pages to images and a vision language model to generate Markdown text from these images.
Features
- Convert PDF files to images.
- Generate Markdown text from images using a local LLM.
- Keep your data private. No third-party file uploads.
Installation
You need a CUDA compatible GPU to run local LLMs with vLLM.
You can use pip to install the package:
pip install pdf2md-llm
Usage
CLI
You can use the pdf2md_llm package via the command line interface (CLI).
To convert a PDF file to Markdown, run the following command:
pdf2md_llm <pdf_file> [options]
Options
pdf_file: Path to the PDF file to convert.--model: Name of the model to use (default:Qwen/Qwen2.5-VL-3B-Instruct-AWQ).--dtype: Data type for the model weights and activations (default:None).--max_model_len: Max model context length (default:7000).--prompt: Custom prompt for the LLM. (default:None).--size: Image size as a tuple (default:(700, None)).--dpi: DPI of the images (default:200).--fmt: Image format (default:jpeg).--output_folder: Folder to save the output Markdown file (default:./out).
Example
pdf2md_llm example.pdf --model "Qwen/Qwen2.5-VL-3B-Instruct-AWQ" --output_folder "./output"
Model Support:
Currently the following Qwen2.5-VL models are supported:
Qwen/Qwen2.5-VL-3B-InstructQwen/Qwen2.5-VL-3B-Instruct-AWQQwen/Qwen2.5-VL-7B-InstructQwen/Qwen2.5-VL-7B-Instruct-AWQQwen/Qwen2.5-VL-72B-InstructQwen/Qwen2.5-VL-72B-Instruct-AWQ
If you want to use a different model, feel free to add a vLLM compatible model to the factory function llm_model() in llm.py
Python API
You can use the pdf2md_llm package via the Python API.
Basic usage:
from vllm import SamplingParams
from pdf2md_llm.llm import llm_model
from pdf2md_llm.pdf2img import PdfToImg
pdf2img = PdfToImg(size=(700, None), output_folder="./out")
img_files = pdf2img.convert("example.pdf")
llm = llm_model(
model="Qwen/Qwen2.5-VL-3B-Instruct-AWQ", # Name of the huggingface model
dtype="half", # Model data type
)
sampling_params = SamplingParams(
temperature=0.1,
min_p=0.1,
max_tokens=8192,
stop_token_ids=[],
)
# Append all pages to one output Markdown file
for img_file in img_files:
markdown_text = llm.generate(
img_file, sampling_params=sampling_params
) # convert image to Markdown with LLM
with open("example.md", "a", encoding="utf-8") as myfile:
myfile.write(markdown_text)
For a full example, see example_api.py
License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgements
-
pdf2image for converting PDF files to images.
-
Qwen2.5-VL LLM model
-
vLLM for efficient LLM model inference
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2md_llm-0.1.3.tar.gz.
File metadata
- Download URL: pdf2md_llm-0.1.3.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
480abff9aee9b3b8492c906cb8a859178ab3b279551f0cc7cfb9fc4834954a0d
|
|
| MD5 |
028a5d1fe835bce92812e44d442c954f
|
|
| BLAKE2b-256 |
9d0445cfc250a03de8a8b4d679a81d4cc88567dcceb2ae1a955f8c57409d1bc8
|
File details
Details for the file pdf2md_llm-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pdf2md_llm-0.1.3-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5fb7825f5558a06155d7673e6d443683b9ef954021edaa083eff3d82201d4c6b
|
|
| MD5 |
6ee549a15639491b46a66e5451fcb578
|
|
| BLAKE2b-256 |
b6d621d6e577d289258c70d24894204ba096cc62a30ada65b19311eac2972ad1
|