Skip to main content

A Python library and CLI tool that uses LLMs to enhance PDF files

Project description

pdfalive logo


CI GitHub License PyPI - Version PyPI - Python Version

pdfalive: A Python library and set of CLI tools to bring PDF files alive with the magic of LLMs.

Features:

  • Automatically generate a Table of Contents via PDF Bookmarks for PDF file using LLMs. Supports arbitrarily large files with intelligent batching.
  • Automatically detect if OCR is needed to parse text from raster data. If needed, performs OCR via Tesseract OCR library.
  • Choose which LLM to use from any vendor. Supports using local models via ollama as well. Retry logic included for handling rate limits.

Installation

the tesseract library is required for OCR. This is used for PDFs where text is not parsed. On MacOS, can install via Homebrew:

brew install tesseract

You can then install the pdfalive package via pip for example:

pip install pdfalive

Usage

To use the CLIs described below, you can install the python package (pip install pdfalive), or run the cli directly using uvx:

uvx pdfalive generate-toc input.pdf output.pdf

More detailed examples of the CLI sub-commands are provided below. You can also use --help on the main command-line and any of the sub-commands to get an idea of the different options supported.

generate-toc

Automatically generate clickable Table of Contents (using PDF bookmarks) for a PDF file. The tool extracts font and text features from the PDF and uses an LLM to intelligently identify chapter and section headings.

Basic usage:

pdfalive generate-toc input.pdf output.pdf

Choosing an LLM: By default we use the latest OpenAI model, but you can use any LLM supported by LangChain:

pdfalive generate-toc --model-identifier 'claude-sonnet-4-5' input.pdf output.pdf

Set the appropriate API key for your provider (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY).

Scanned PDFs: OCR is enabled by default. If your PDF is a scanned document without extractable text, OCR will be performed automatically to extract text before TOC generation.

By default, the OCR text layer is included in the output (making it searchable). To generate TOC without keeping the OCR text (preserving original file size):

pdfalive generate-toc --no-ocr-output scanned.pdf output.pdf

To disable automatic OCR detection entirely:

pdfalive generate-toc --no-ocr input.pdf output.pdf

Other useful options:

  • --force - Overwrite existing TOC if the PDF already has bookmarks
  • --ocr-language - Set OCR language (default: eng). Use Tesseract language codes like deu, fra, etc.

extract-text

Extract text from scanned PDFs using OCR and save to a new PDF with an embedded text layer:

pdfalive extract-text input.pdf output.pdf

This is useful when you want a searchable/selectable text layer without generating a TOC.

Development

We use uv to manage the library. To install locally can run e.g. with:

uv sync
uv pip install -e .

We use ruff for formatting and linting, mypy for static type checking, and pytest for running unit-tests. We also use pre-commit for ensuring high-quality commits.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfalive-0.3.0.tar.gz (3.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfalive-0.3.0-py3-none-any.whl (138.6 kB view details)

Uploaded Python 3

File details

Details for the file pdfalive-0.3.0.tar.gz.

File metadata

  • Download URL: pdfalive-0.3.0.tar.gz
  • Upload date:
  • Size: 3.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdfalive-0.3.0.tar.gz
Algorithm Hash digest
SHA256 49d3751fea1bc62b8bd6918df615512c49444cf4122dedfee6b1240c8c301fa2
MD5 cdc53ffeba5c914d8ed63129f27fdca8
BLAKE2b-256 565d0c84dffd6adf64b425917ab7f53691c35609266ee66669a8dfe547526214

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdfalive-0.3.0.tar.gz:

Publisher: publish-to-pypi.yml on promptromp/pdfalive

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdfalive-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pdfalive-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 138.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdfalive-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e0eb7b32aa46834e5e4fd156adc8f2069a5dc5c19ca9f64ed0cf49d438477c93
MD5 17d105af80c8cad597e5dc9655248aac
BLAKE2b-256 2a7ba2154163de9ea47c534ae7ff4be0df6bd52caf79c5070a95de53998c80e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdfalive-0.3.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on promptromp/pdfalive

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page