Skip to main content

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.

Project description

pdfa-parser

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDFzero-config, batteries included.

pdfa-parser is a Python library (Python ≥ 3.10) that wraps GhostScript for PDF → PDF/A conversion and VeraPDF for conformance validation. All external tools are downloaded automatically on first use — just pip install and go.

from pdfa_parser import create_parser

parser = create_parser()
parser.convert("input.pdf", "output.pdf")

result = parser.validate("output.pdf")
print(result.compliant)  # True

Features

  • PDF → PDF/A conversion via GhostScript (levels 1, 2, 3).
  • PDF/A validation via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
  • Zero config — GhostScript, Java (JRE), and VeraPDF are resolved automatically (system PATH → apt-get → binary download).
  • Works in Dockerpip install pdfa-parser in a bare python:3.x-slim image is all you need.
  • Sync & async — every public method has an a_ async counterpart.
  • Factory function (create_parser()) for instant quick start.
  • Adapter pattern — swap GhostScript / VeraPDF for any CLI tool by implementing IBaseAdapter.
  • CLIpdfa-parser input.pdf output.pdf or python -m pdfa_parser.
  • Typed — ships with py.typed marker and full type annotations.

Installation

pip install pdfa-parser

That's it. No system packages to install, no manual binary setup.

Development install (with test dependencies):

pip install -e ".[dev]"

Quick start

Python API

from pdfa_parser import create_parser

# Create a parser (GhostScript + VeraPDF are auto-resolved)
parser = create_parser()

# Convert a PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")

# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant)   # True / False
print(result.profile)     # "PDF/A-2B validation profile"

# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant

Tip: PdfaParser is a convenience alias for PdfParser — both work:

from pdfa_parser import PdfaParser          # alias
from pdfa_parser import PdfParser           # canonical name
from pdfa_parser import create_parser       # recommended factory

Conversion only (no VeraPDF)

parser = create_parser(with_verapdf=False)
parser.convert("input.pdf", "output.pdf")

Async API

Every method has an a_ prefixed async twin:

import asyncio
from pdfa_parser import create_parser

async def main():
    parser = create_parser()
    await parser.a_convert("input.pdf", "output.pdf")
    result = await parser.a_validate("output.pdf")
    print(result.compliant)

asyncio.run(main())

CLI

# Basic conversion
pdfa-parser input.pdf output.pdf

# With validation
pdfa-parser input.pdf output.pdf --validate

# PDF/A level 1, flavour 1b
pdfa-parser input.pdf output.pdf --level 1 --validate --flavour 1b

# Also works as a module
python -m pdfa_parser input.pdf output.pdf --validate

How dependency resolution works

On first use, the library checks for each tool in this order:

Tool 1. System PATH 2. Package manager 3. Download
GhostScript gs / gswin64c apt-get install ghostscript GitHub archive (fallback)
Java (JRE) java Adoptium Temurin 21
VeraPDF Maven Central JAR
  • Binaries are stored in ~/.local/share/pdfa-parser/bin/ (or src/bin/ during development).
  • The JRE and VeraPDF JAR are downloaded once and reused across runs.
  • You can force a specific binary by setting the adapter path manually (see Advanced usage).

Public API reference

Top-level imports

from pdfa_parser import (
    create_parser,      # Factory — recommended entry point
    PdfParser,          # Core class (canonical name)
    PdfaParser,         # Alias for PdfParser
    ValidationResult,   # Dataclass returned by validate()
    DependencyManager,  # Manual dependency orchestration
    # For custom adapters:
    IBaseAdapter,
    BinaryExecuter,
    GhostScriptAdapter,
    VeraPDFAdapter,
)

create_parser(**kwargs) → PdfParser

Parameter Type Default Description
pdfa_level int 2 PDF/A conformance level (1, 2, 3)
with_verapdf bool True Attach VeraPDF for validation
extra_gs_args tuple[str,...] () Extra flags for every GhostScript call

PdfParser methods

Method Returns Description
convert(input, output) Path Convert PDF to PDF/A
validate(file, *, flavour) ValidationResult Check PDF/A compliance via VeraPDF
convert_and_validate(…) ValidationResult Convert then validate in one call
a_convert(…) Path Async convert
a_validate(…) ValidationResult Async validate
a_convert_and_validate(…) ValidationResult Async convert + validate

All path parameters accept both str and pathlib.Path.

ValidationResult

Field Type Description
compliant bool True if the PDF satisfies the profile
profile str Profile name (e.g. "PDF/A-2B …")
details str Raw XML snippet for debugging

Advanced usage

Custom adapters

from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path

class MyGSAdapter(IBaseAdapter):
    def get_binary_path(self) -> Path:
        return Path("/opt/gs-10/bin/gs")

parser = PdfParser(
    gs_executer=BinaryExecuter(MyGSAdapter()),
    pdfa_level=3,
    extra_gs_args=("-dQUIET",),
)

Manual dependency management

from pdfa_parser import DependencyManager

m = DependencyManager()

# Check availability without downloading
print(m.ghostscript.is_available())  # True / False
print(m.verapdf.is_available())

# Force download / resolution
gs_path = m.ensure_ghostscript()
verapdf_path = m.ensure_verapdf()

Project structure

pdfa-parser/
├── src/pdfa_parser/
│   ├── __init__.py             # Public API, create_parser(), PdfaParser alias
│   ├── __main__.py             # python -m pdfa_parser
│   ├── main.py                 # CLI entry-point
│   ├── pdf_parser.py           # PdfParser – convert / validate
│   ├── settings.py             # Lazy binary-path resolution
│   ├── data/
│   │   ├── PDFA_def.ps         # Bundled PostScript for PDF/A OutputIntent
│   │   └── srgb.icc            # Bundled sRGB ICC profile
│   ├── dependencies/
│   │   ├── _base.py            # Dependency / ResolutionStrategy ABCs
│   │   ├── _ghostscript.py     # GhostScript strategies
│   │   ├── _jre.py             # JRE (Adoptium) strategies
│   │   ├── _verapdf.py         # VeraPDF (Maven JAR) strategies
│   │   └── _manager.py         # DependencyManager orchestrator
│   ├── interfaces/
│   │   ├── base_adapter.py     # IBaseAdapter (ABC)
│   │   └── binary_executer.py  # BinaryExecuter (facade)
│   └── implementations/
│       ├── ghostscript_adapter.py
│       └── verapdf_adapter.py
├── tests/
│   ├── conftest.py             # Fixtures, skip markers, PDF generation
│   ├── test_unit.py            # Unit tests (no binaries needed)
│   ├── test_integration.py     # Integration tests (real binaries)
│   ├── test_sample_files.py    # Tests for bundled sample PDFs
│   ├── test_dependencies.py    # Dependency resolution tests
│   └── files/
│       ├── sample_pdf.pdf      # Regular PDF sample
│       └── sample_pdfa.pdf     # PDF/A sample
├── pyproject.toml
├── LICENSE
└── README.md

Testing

# Everything (integration tests auto-skip if binaries are missing)
pytest -v

# Unit tests only (no binaries required)
pytest tests/test_unit.py -v

# Integration + sample file tests
pytest tests/test_integration.py tests/test_sample_files.py -v

Test suites

Suite Tests Requires binaries What it covers
test_unit.py 26 No Helpers, XML parsing, arg building, mocked convert/validate, async, factory
test_dependencies.py 38 No Dependency resolution strategies, DependencyManager, backward-compat shim
test_integration.py 20 Yes Real GS conversion, VeraPDF validation, round-trip, async, multiple PDF types
test_sample_files.py 7 Yes Bundled sample PDFs: conversion, validation, round-trip (sync + async)

Integration tests generate PDFs using reportlab (portrait, landscape, coloured shapes, multi-page, text-heavy) and run them through the full GhostScript → VeraPDF pipeline. Tests are auto-skipped when binaries are not available.

Docker smoke test

# Build the wheel
uv build --wheel

# Run in a clean Python container
docker run --rm \
  -v $PWD/dist:/dist \
  -v $PWD/tests:/tests \
  python:3.13-slim \
  bash -c "pip install /dist/*.whl && python /tests/docker_smoke_test.py"

Requirements

Requirement Version Notes
Python ≥ 3.10 No runtime dependencies beyond the standard library
GhostScript any Auto-installed via apt-get or system PATH
Java (JRE) ≥ 11 Auto-downloaded from Adoptium if missing
VeraPDF 1.26.5 Auto-downloaded from Maven Central

License

GPLv3+

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfa_parser-1.1.1.tar.gz (35.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfa_parser-1.1.1-py3-none-any.whl (41.9 kB view details)

Uploaded Python 3

File details

Details for the file pdfa_parser-1.1.1.tar.gz.

File metadata

  • Download URL: pdfa_parser-1.1.1.tar.gz
  • Upload date:
  • Size: 35.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-1.1.1.tar.gz
Algorithm Hash digest
SHA256 f6c8376ea671ad6ca2a3beb83a14965a45a21842686bb5b7096b7b60e6af3288
MD5 d3f8e908f37f300e85494f24188cdbb5
BLAKE2b-256 9313be2221d77f9d2209c4482aaef6ab0d5469c7dcb13d4c7b6935b61a952bb1

See more details on using hashes here.

File details

Details for the file pdfa_parser-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdfa_parser-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 41.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0bd984be2f452a1b0e0e5be014b6f82d87b339e4e4e8d12d358aa418a4e625c3
MD5 6b1832c7d12292e940cc32cca5c5eea6
BLAKE2b-256 ca565ad40628a1e4837696ef4c5029ab845243d1e4f1bbafd78daef3aee3816c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page