Convert PDF files to the archival PDF/A format
Project description
pdftopdfa
I built pdftopdfa as a free and open-source alternative to Ghostscript-based PDF/A converters. Ghostscript uses a dual license (AGPL/commercial) that makes it difficult to use in commercial products without purchasing a license. pdftopdfa is licensed under the permissive MPL-2.0 and can be freely used in commercial projects. Instead of re-rendering via Ghostscript, it modifies the PDF structure directly using pikepdf (based on QPDF), preserving the original content, fonts, and layout.
Highlights
- No Ghostscript required -- direct PDF manipulation via pikepdf/QPDF
- PDF/A-2b, 2u, 3b, 3u -- supports modern PDF/A levels (ISO 19005-2 and 19005-3)
- Automatic font embedding -- uses policy-approved Windows system fonts or bundled replacements
- Font subsetting -- reduces file size by removing unused glyphs
- CJK support -- embeds Noto Sans CJK for Chinese, Japanese, and Korean text
- ICC color profiles -- automatically embeds sRGB, CMYK, and grayscale profiles
- Batch processing -- converts entire directories, optionally recursive
- Integrated validation -- checks conformance via veraPDF
- OCR support -- optional text recognition for scanned PDFs via Tesseract
- Simple API -- usable as CLI tool or Python library
How It Works
pdftopdfa applies a multi-step conversion pipeline to make a PDF compliant with the PDF/A standard:
- Pre-check -- detects if the PDF is already a valid PDF/A file (skips conversion if the existing level meets or exceeds the target; optionally skips any veraPDF-compliant PDF/A via
--skip-any-pdfa; see Usage Guide for details) - OCR (optional) -- runs Tesseract via ocrmypdf on scanned pages without a text layer
- Font compliance -- analyzes all fonts, embeds missing ones, adds ToUnicode mappings, subsets embedded fonts, and fixes encoding issues
- Sanitization -- removes or fixes non-compliant elements (JavaScript, non-standard actions, transparency groups, annotations, optional content, etc.)
- Metadata -- synchronizes XMP metadata with the document info dictionary and sets the PDF/A conformance level
- Color profiles -- detects color spaces and embeds the required ICC profiles (sRGB, CMYK/FOGRA39, sGray)
- Save -- writes the output with the correct PDF version header
Installation
Prerequisites
- Python 3.12, 3.13, or 3.14
- macOS, Linux, or Windows
pip install pdftopdfa
Optional: OCR support
pip install "pdftopdfa[ocr]"
OCR requires a Tesseract installation on the system. See docs/ocr.md for details on OCR usage and quality presets.
Quick Start
# Simple conversion (creates document_pdfa.pdf)
pdftopdfa document.pdf
# Specific PDF/A level
pdftopdfa -l 2b document.pdf
# With validation
pdftopdfa -v document.pdf
# Skip any existing veraPDF-compliant PDF/A
pdftopdfa --skip-any-pdfa document.pdf
# Convert an entire directory
pdftopdfa -r ./documents/ ./output/
# OCR for scanned PDFs
pdftopdfa --ocr document.pdf
# Preserve known proprietary stamps as PDF Stamp annotations
pdftopdfa --preserve-stamps document.pdf
from pathlib import Path
from pdftopdfa import convert_to_pdfa
result = convert_to_pdfa(
input_path=Path("input.pdf"),
output_path=Path("output.pdf"),
level="2b",
)
See docs/usage.md for the full CLI reference, Python API documentation, and examples.
Limitations
- No PDF/A-1 support -- only PDF/A-2 and PDF/A-3 levels are supported
- Encrypted PDFs -- password-protected PDFs cannot be converted
- Font replacement -- fonts without a suitable metrically compatible replacement produce a warning; the resulting file may not be fully compliant
Font Sourcing
- On Windows,
pdftopdfamay automatically embed a conservative fixed allowlist of local fonts from%WINDIR%\Fonts. - A Windows system font is only used when the installed file lives under
%WINDIR%\Fonts, its actual PostScript name is allowlisted, and its OpenTypefsTypepermits outline embedding. - On macOS and Linux, system fonts are never auto-embedded; bundled replacement fonts are used instead.
fsTypechecks are a technical safeguard only and do not replace the font vendor's EULA or other license terms.- For auditable deployments, keep the allowlist tied to reviewed target systems or golden images.
Development
pip install -e ".[dev]"
Running Tests
pytest
The test suite contains 2600+ tests covering fonts, color profiles, metadata, sanitization, and end-to-end conversion.
Code Quality
ruff check src/ tests/ # Linting
ruff format src/ tests/ # Formatting
Documentation
Additional documentation is available in the docs/ folder:
Contributing
Contributions are welcome! Please open an issue to report bugs or suggest features, or submit a pull request.
Dependencies
Core:
- pikepdf -- PDF manipulation (based on QPDF)
- lxml -- XMP metadata processing
- fonttools -- Font analysis, subsetting, and embedding
- click -- CLI framework
- colorama -- Colored terminal output
- tqdm -- Progress bars
Optional:
- ocrmypdf -- OCR support (requires Tesseract)
- pypdfium2 -- PDF page rasterizer for OCR
- veraPDF -- ISO-compliant PDF/A validation
Acknowledgments
This project bundles the following resources:
- Liberation Fonts -- metrically compatible replacements for the PDF Standard 14 fonts (SIL OFL 1.1)
- Noto Sans CJK -- CJK font coverage (SIL OFL 1.1)
- Noto Sans Symbols 2 -- symbol font replacement (SIL OFL 1.1)
- STIX Two Math -- math font replacement (SIL OFL 1.1)
- sRGB2014.icc -- ICC sRGB profile (ICC)
- ISOcoated_v2_300_bas.icc -- ICC CMYK profile, FOGRA39 (zlib/libpng license)
- sGray -- compact grayscale ICC profile (CC0-1.0)
- Adobe cmap-resources -- CID-to-Unicode mapping data (BSD 3-Clause)
License
This project is licensed under the Mozilla Public License 2.0 or later (MPL-2.0+) -- see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdftopdfa-0.4.2.tar.gz.
File metadata
- Download URL: pdftopdfa-0.4.2.tar.gz
- Upload date:
- Size: 20.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf73e0faf486efdfdacf73903c4a9b6f04f68fbe6d0237d2f29409d0ac1168bc
|
|
| MD5 |
9c0497e59dac19c757c3310794d9ed99
|
|
| BLAKE2b-256 |
7be4395cfc71fd60497fa19393420d6b85825a10e389c39b83109bff9b3349a4
|
Provenance
The following attestation bundles were made for pdftopdfa-0.4.2.tar.gz:
Publisher:
publish.yml on iRedPaul/pdftopdfa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdftopdfa-0.4.2.tar.gz -
Subject digest:
cf73e0faf486efdfdacf73903c4a9b6f04f68fbe6d0237d2f29409d0ac1168bc - Sigstore transparency entry: 1690837948
- Sigstore integration time:
-
Permalink:
iRedPaul/pdftopdfa@7640bfa64145af9752213bb06a83bbe0b8ea5042 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/iRedPaul
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7640bfa64145af9752213bb06a83bbe0b8ea5042 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pdftopdfa-0.4.2-py3-none-any.whl.
File metadata
- Download URL: pdftopdfa-0.4.2-py3-none-any.whl
- Upload date:
- Size: 20.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6850e6bbf910ef8910ae1ce640e66f7d66c76ec800079768b874f914dd220c93
|
|
| MD5 |
95760947c624a2b7b28ffa4b408e0857
|
|
| BLAKE2b-256 |
b29eda8100de40a167330b520a8c24008a8cd67ddac98e8dca2f1d99aeb22af6
|
Provenance
The following attestation bundles were made for pdftopdfa-0.4.2-py3-none-any.whl:
Publisher:
publish.yml on iRedPaul/pdftopdfa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdftopdfa-0.4.2-py3-none-any.whl -
Subject digest:
6850e6bbf910ef8910ae1ce640e66f7d66c76ec800079768b874f914dd220c93 - Sigstore transparency entry: 1690837957
- Sigstore integration time:
-
Permalink:
iRedPaul/pdftopdfa@7640bfa64145af9752213bb06a83bbe0b8ea5042 -
Branch / Tag:
refs/tags/v0.4.2 - Owner: https://github.com/iRedPaul
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7640bfa64145af9752213bb06a83bbe0b8ea5042 -
Trigger Event:
release
-
Statement type: