Convert PDF files to the archival PDF/A format

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ired

These details have not been verified by PyPI

Project description

pdftopdfa

Python Version License

I built pdftopdfa as a free and open-source alternative to Ghostscript-based PDF/A converters. Ghostscript uses a dual license (AGPL/commercial) that makes it difficult to use in commercial products without purchasing a license. pdftopdfa is licensed under the permissive MPL-2.0 and can be freely used in commercial projects. Instead of re-rendering via Ghostscript, it modifies the PDF structure directly using pikepdf (based on QPDF), preserving the original content, fonts, and layout.

Highlights

No Ghostscript required -- direct PDF manipulation via pikepdf/QPDF
PDF/A-2b, 2u, 3b, 3u -- supports modern PDF/A levels (ISO 19005-2 and 19005-3)
Automatic font embedding -- uses policy-approved Windows system fonts or bundled replacements
Font subsetting -- reduces file size by removing unused glyphs
CJK support -- embeds Noto Sans CJK for Chinese, Japanese, and Korean text
ICC color profiles -- automatically embeds sRGB, CMYK, and grayscale profiles
Batch processing -- converts entire directories, optionally recursive
Integrated validation -- checks conformance via veraPDF
OCR support -- optional text recognition for scanned PDFs via Tesseract
Simple API -- usable as CLI tool or Python library

How It Works

pdftopdfa applies a multi-step conversion pipeline to make a PDF compliant with the PDF/A standard:

Pre-check -- detects if the PDF is already a valid PDF/A file (skips conversion if the existing level meets or exceeds the target; optionally skips any veraPDF-compliant PDF/A via --skip-any-pdfa; see Usage Guide for details)
OCR (optional) -- runs Tesseract via ocrmypdf on scanned pages without a text layer
Font compliance -- analyzes all fonts, embeds missing ones, adds ToUnicode mappings, subsets embedded fonts, and fixes encoding issues
Sanitization -- removes or fixes non-compliant elements (JavaScript, non-standard actions, transparency groups, annotations, optional content, etc.)
Metadata -- synchronizes XMP metadata with the document info dictionary and sets the PDF/A conformance level
Color profiles -- detects color spaces and embeds the required ICC profiles (sRGB, CMYK/FOGRA39, sGray)
Save -- writes the output with the correct PDF version header

Installation

Prerequisites

Python 3.12, 3.13, or 3.14
macOS, Linux, or Windows

pip install pdftopdfa

Optional: OCR support

pip install "pdftopdfa[ocr]"

OCR requires a Tesseract installation on the system. See docs/ocr.md for details on OCR usage and quality presets.

Quick Start

# Simple conversion (creates document_pdfa.pdf)
pdftopdfa document.pdf

# Specific PDF/A level
pdftopdfa -l 2b document.pdf

# With validation
pdftopdfa -v document.pdf

# Skip any existing veraPDF-compliant PDF/A
pdftopdfa --skip-any-pdfa document.pdf

# Convert an entire directory
pdftopdfa -r ./documents/ ./output/

# OCR for scanned PDFs
pdftopdfa --ocr document.pdf

from pathlib import Path
from pdftopdfa import convert_to_pdfa

result = convert_to_pdfa(
    input_path=Path("input.pdf"),
    output_path=Path("output.pdf"),
    level="2b",
)

See docs/usage.md for the full CLI reference, Python API documentation, and examples.

Limitations

No PDF/A-1 support -- only PDF/A-2 and PDF/A-3 levels are supported
Encrypted PDFs -- password-protected PDFs cannot be converted
Font replacement -- fonts without a suitable metrically compatible replacement produce a warning; the resulting file may not be fully compliant

Font Sourcing

On Windows, pdftopdfa may automatically embed a conservative fixed allowlist of local fonts from %WINDIR%\Fonts.
A Windows system font is only used when the installed file lives under %WINDIR%\Fonts, its actual PostScript name is allowlisted, and its OpenType fsType permits outline embedding.
On macOS and Linux, system fonts are never auto-embedded; bundled replacement fonts are used instead.
fsType checks are a technical safeguard only and do not replace the font vendor's EULA or other license terms.
For auditable deployments, keep the allowlist tied to reviewed target systems or golden images.

Development

pip install -e ".[dev]"

Running Tests

pytest

The test suite contains 2600+ tests covering fonts, color profiles, metadata, sanitization, and end-to-end conversion.

Code Quality

ruff check src/ tests/   # Linting
ruff format src/ tests/  # Formatting

Documentation

Additional documentation is available in the docs/ folder:

Contributing

Contributions are welcome! Please open an issue to report bugs or suggest features, or submit a pull request.

Dependencies

Core:

pikepdf -- PDF manipulation (based on QPDF)
lxml -- XMP metadata processing
fonttools -- Font analysis, subsetting, and embedding
click -- CLI framework
colorama -- Colored terminal output
tqdm -- Progress bars

Optional:

ocrmypdf -- OCR support (requires Tesseract)
pypdfium2 -- PDF page rasterizer for OCR
OpenCV -- improved OCR preprocessing (deskewing, denoising)
veraPDF -- ISO-compliant PDF/A validation

Acknowledgments

This project bundles the following resources:

Liberation Fonts -- metrically compatible replacements for the PDF Standard 14 fonts (SIL OFL 1.1)
Noto Sans CJK -- CJK font coverage (SIL OFL 1.1)
Noto Sans Symbols 2 -- symbol font replacement (SIL OFL 1.1)
STIX Two Math -- math font replacement (SIL OFL 1.1)
sRGB2014.icc -- ICC sRGB profile (ICC)
ISOcoated_v2_300_bas.icc -- ICC CMYK profile, FOGRA39 (zlib/libpng license)
sGray -- compact grayscale ICC profile (CC0-1.0)
Adobe cmap-resources -- CID-to-Unicode mapping data (BSD 3-Clause)

License

This project is licensed under the Mozilla Public License 2.0 or later (MPL-2.0+) -- see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ired

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.3

Jun 1, 2026

0.4.2

Jun 1, 2026

0.4.1

Apr 30, 2026

0.4.0

Apr 22, 2026

0.3.15

Apr 22, 2026

0.3.14

Apr 21, 2026

0.3.13

Apr 21, 2026

0.3.12

Apr 16, 2026

0.3.11

Apr 16, 2026

0.3.10

Apr 16, 2026

0.3.9

Apr 15, 2026

0.3.8

Apr 15, 2026

0.3.7

Apr 14, 2026

0.3.6

Apr 10, 2026

This version

0.3.5

Apr 10, 2026

0.3.4

Apr 9, 2026

0.3.3

Apr 9, 2026

0.3.2

Apr 9, 2026

0.3.1

Apr 9, 2026

0.3.0

Apr 9, 2026

0.2.24

Apr 7, 2026

0.2.23

Apr 7, 2026

0.2.22

Apr 2, 2026

0.2.21

Apr 2, 2026

0.2.20

Apr 1, 2026

0.2.19

Apr 1, 2026

0.2.18

Apr 1, 2026

0.2.17

Apr 1, 2026

0.2.16

Mar 29, 2026

0.2.15

Mar 26, 2026

0.2.14

Mar 26, 2026

0.2.13

Mar 26, 2026

0.2.12

Mar 26, 2026

0.2.11

Mar 26, 2026

0.2.10

Mar 24, 2026

0.2.9

Mar 24, 2026

0.2.8

Mar 20, 2026

0.2.7

Mar 18, 2026

0.2.6

Mar 13, 2026

0.2.5

Mar 13, 2026

0.2.4

Mar 13, 2026

0.2.3

Mar 12, 2026

0.2.2

Mar 11, 2026

0.2.1

Feb 24, 2026

0.2.0

Feb 23, 2026

0.1.4

Feb 18, 2026

0.1.3

Feb 17, 2026

0.1.2

Feb 17, 2026

0.1.1

Feb 17, 2026

0.1.0

Feb 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftopdfa-0.3.5.tar.gz (20.9 MB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdftopdfa-0.3.5-py3-none-any.whl (20.7 MB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file pdftopdfa-0.3.5.tar.gz.

File metadata

Download URL: pdftopdfa-0.3.5.tar.gz
Upload date: Apr 10, 2026
Size: 20.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdftopdfa-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`338be03a1fbef0a5a469cff580cbcecc7a3e998c074167c6050a3397b92f3bef`
MD5	`1a28bb40ad440edfac4de06b692725c1`
BLAKE2b-256	`62d2266275b3343bbf66252fb377bb86b1c67e59eba799ca57aff55f765da569`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdftopdfa-0.3.5.tar.gz:

Publisher: publish.yml on iRedPaul/pdftopdfa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdftopdfa-0.3.5.tar.gz
- Subject digest: 338be03a1fbef0a5a469cff580cbcecc7a3e998c074167c6050a3397b92f3bef
- Sigstore transparency entry: 1270669750
- Sigstore integration time: Apr 10, 2026
Source repository:
- Permalink: iRedPaul/pdftopdfa@ab6559f415a3b5099a9b86dba8e4617d9e3e16f6
- Branch / Tag: refs/tags/v0.3.5
- Owner: https://github.com/iRedPaul
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ab6559f415a3b5099a9b86dba8e4617d9e3e16f6
- Trigger Event: release

File details

Details for the file pdftopdfa-0.3.5-py3-none-any.whl.

File metadata

Download URL: pdftopdfa-0.3.5-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 20.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdftopdfa-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3bf1c10274a02a7f2a9b4482a96b785e50cb36d3da45227c9bc667aa2b1b0b93`
MD5	`0e4d95401ae313fcd5fc9ca03c3a4e7c`
BLAKE2b-256	`e1bf4bd83b93e1cff80889ac70bde27f06b30ae70dc89341b70ff1bb782b4525`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdftopdfa-0.3.5-py3-none-any.whl:

Publisher: publish.yml on iRedPaul/pdftopdfa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdftopdfa-0.3.5-py3-none-any.whl
- Subject digest: 3bf1c10274a02a7f2a9b4482a96b785e50cb36d3da45227c9bc667aa2b1b0b93
- Sigstore transparency entry: 1270669779
- Sigstore integration time: Apr 10, 2026
Source repository:
- Permalink: iRedPaul/pdftopdfa@ab6559f415a3b5099a9b86dba8e4617d9e3e16f6
- Branch / Tag: refs/tags/v0.3.5
- Owner: https://github.com/iRedPaul
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ab6559f415a3b5099a9b86dba8e4617d9e3e16f6
- Trigger Event: release

pdftopdfa 0.3.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pdftopdfa

Highlights

How It Works

Installation

Prerequisites

Optional: OCR support

Quick Start

Limitations

Font Sourcing

Development

Running Tests

Code Quality

Documentation

Contributing

Dependencies

Acknowledgments

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance