Skip to main content

Convert PDF files to the archival PDF/A format

Project description

pdftopdfa

Python Version License

I built pdftopdfa as a free and open-source alternative to Ghostscript-based PDF/A converters. Ghostscript uses a dual license (AGPL/commercial) that makes it difficult to use in commercial products without purchasing a license. pdftopdfa is licensed under the permissive MPL-2.0 and can be freely used in commercial projects. Instead of re-rendering via Ghostscript, it modifies the PDF structure directly using pikepdf (based on QPDF), preserving the original content, fonts, and layout.

This project was built with Claude Code by Anthropic.

Highlights

  • No Ghostscript required -- direct PDF manipulation via pikepdf/QPDF
  • PDF/A-2b, 2u, 3b, 3u -- supports modern PDF/A levels (ISO 19005-2 and 19005-3)
  • Automatic font embedding -- embeds missing fonts with metrically compatible replacements
  • Font subsetting -- reduces file size by removing unused glyphs
  • CJK support -- embeds Noto Sans CJK for Chinese, Japanese, and Korean text
  • ICC color profiles -- automatically embeds sRGB, CMYK, and grayscale profiles
  • Batch processing -- converts entire directories, optionally recursive
  • Integrated validation -- checks conformance via veraPDF
  • OCR support -- optional text recognition for scanned PDFs via Tesseract
  • Simple API -- usable as CLI tool or Python library

How It Works

pdftopdfa applies a multi-step conversion pipeline to make a PDF compliant with the PDF/A standard:

  1. Pre-check -- detects if the PDF is already a valid PDF/A file (skips conversion if the existing level meets or exceeds the target; see Usage Guide for details)
  2. OCR (optional) -- runs Tesseract via ocrmypdf on scanned pages without a text layer
  3. Font compliance -- analyzes all fonts, embeds missing ones, adds ToUnicode mappings, subsets embedded fonts, and fixes encoding issues
  4. Sanitization -- removes or fixes non-compliant elements (JavaScript, non-standard actions, transparency groups, annotations, optional content, etc.)
  5. Metadata -- synchronizes XMP metadata with the document info dictionary and sets the PDF/A conformance level
  6. Color profiles -- detects color spaces and embeds the required ICC profiles (sRGB, CMYK/FOGRA39, sGray)
  7. Save -- writes the output with the correct PDF version header

Installation

Prerequisites

  • Python 3.12, 3.13, or 3.14
  • macOS, Linux, or Windows

Install from PyPI

pip install pdftopdfa

Install from the repository

git clone https://github.com/iredpaul/pdftopdfa.git
cd pdftopdfa
pip install .

Optional: OCR support

pip install ".[ocr]"

OCR requires a Tesseract installation on the system. See docs/ocr.md for details on OCR usage and quality presets.

Quick Start

# Simple conversion (creates document_pdfa.pdf)
pdftopdfa document.pdf

# Specific PDF/A level
pdftopdfa -l 2b document.pdf

# With validation
pdftopdfa -v document.pdf

# Convert an entire directory
pdftopdfa -r ./documents/ ./output/

# OCR for scanned PDFs
pdftopdfa --ocr document.pdf
from pathlib import Path
from pdftopdfa import convert_to_pdfa

result = convert_to_pdfa(
    input_path=Path("input.pdf"),
    output_path=Path("output.pdf"),
    level="2b",
)

See docs/usage.md for the full CLI reference, Python API documentation, and examples.

Limitations

  • No PDF/A-1 support -- only PDF/A-2 and PDF/A-3 levels are supported
  • Encrypted PDFs -- password-protected PDFs cannot be converted
  • Font replacement -- fonts without a suitable metrically compatible replacement produce a warning; the resulting file may not be fully compliant
  • Platform -- supported on macOS, Linux, and Windows
  • Python versions -- tested on Python 3.12, 3.13, and 3.14

Development

pip install -e ".[dev]"

Running Tests

pytest

The test suite contains 2400+ tests covering fonts, color profiles, metadata, sanitization, and end-to-end conversion.

Code Quality

ruff check src/   # Linting
ruff format src/  # Formatting

Documentation

Additional documentation is available in the docs/ folder:

Contributing

Contributions are welcome! Please open an issue to report bugs or suggest features, or submit a pull request.

Dependencies

Core:

  • pikepdf -- PDF manipulation (based on QPDF)
  • lxml -- XMP metadata processing
  • fonttools -- Font analysis, subsetting, and embedding
  • click -- CLI framework
  • colorama -- Colored terminal output
  • tqdm -- Progress bars

Optional:

Acknowledgments

This project bundles the following resources:

License

This project is licensed under the Mozilla Public License 2.0 or later (MPL-2.0+) -- see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftopdfa-0.1.3.tar.gz (20.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdftopdfa-0.1.3-py3-none-any.whl (20.6 MB view details)

Uploaded Python 3

File details

Details for the file pdftopdfa-0.1.3.tar.gz.

File metadata

  • Download URL: pdftopdfa-0.1.3.tar.gz
  • Upload date:
  • Size: 20.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdftopdfa-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3a336c43a984f288d6aea2421b091278faf495511c7acd060c967f30c0e90842
MD5 ffe96022db1d652fff87528f0248c6d1
BLAKE2b-256 82c88653415c16e3c619ba893b099b0151947392042f5f6e16a109e87d9acdcf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdftopdfa-0.1.3.tar.gz:

Publisher: publish.yml on iRedPaul/pdftopdfa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdftopdfa-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pdftopdfa-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 20.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdftopdfa-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7800606c5e3ad66c1b15df72baf4537bff6ad1958f61c4d8c2a9dba1805ad6fa
MD5 2f2e015c1ead42532d9bc292d8b744eb
BLAKE2b-256 b0f58224543d0cff5f88e9683b2829c635cc8c85fa70343efff2caca438c6632

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdftopdfa-0.1.3-py3-none-any.whl:

Publisher: publish.yml on iRedPaul/pdftopdfa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page