Skip to main content

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.

Project description

pdfa-parser

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDFzero-config, batteries included.

pdfa-parser is a Python library (Python ≥ 3.10) that wraps GhostScript for PDF → PDF/A conversion and VeraPDF for conformance validation. All external tools are downloaded automatically on first use — just pip install and go.

from pdfa_parser import create_parser

parser = create_parser()
parser.convert("input.pdf", "output.pdf")

result = parser.validate("output.pdf")
print(result.compliant)  # True

Features

  • PDF → PDF/A conversion via GhostScript (levels 1, 2, 3).
  • PDF/A validation via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
  • Zero config — GhostScript, Java (JRE), and VeraPDF are resolved automatically (system PATH → apt-get → binary download).
  • Works in Dockerpip install pdfa-parser in a bare python:3.x-slim image is all you need.
  • Sync & async — every public method has an a_ async counterpart.
  • Factory function (create_parser()) for instant quick start.
  • Adapter pattern — swap GhostScript / VeraPDF for any CLI tool by implementing IBaseAdapter.
  • CLIpdfa-parser input.pdf output.pdf or python -m pdfa_parser.
  • Typed — ships with py.typed marker and full type annotations.

Installation

pip install pdfa-parser

That's it. No system packages to install, no manual binary setup.

Development install (with test dependencies):

pip install -e ".[dev]"

Quick start

Python API

from pdfa_parser import create_parser

# Create a parser (GhostScript + VeraPDF are auto-resolved)
parser = create_parser()

# Convert a PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")

# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant)   # True / False
print(result.profile)     # "PDF/A-2B validation profile"

# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant

Tip: PdfaParser is a convenience alias for PdfParser — both work:

from pdfa_parser import PdfaParser          # alias
from pdfa_parser import PdfParser           # canonical name
from pdfa_parser import create_parser       # recommended factory

Conversion only (no VeraPDF)

parser = create_parser(with_verapdf=False)
parser.convert("input.pdf", "output.pdf")

Async API

Every method has an a_ prefixed async twin:

import asyncio
from pdfa_parser import create_parser

async def main():
    parser = create_parser()
    await parser.a_convert("input.pdf", "output.pdf")
    result = await parser.a_validate("output.pdf")
    print(result.compliant)

asyncio.run(main())

CLI

# Basic conversion
pdfa-parser input.pdf output.pdf

# With validation
pdfa-parser input.pdf output.pdf --validate

# PDF/A level 1, flavour 1b
pdfa-parser input.pdf output.pdf --level 1 --validate --flavour 1b

# Also works as a module
python -m pdfa_parser input.pdf output.pdf --validate

How dependency resolution works

On first use, the library checks for each tool in this order:

Tool 1. System PATH 2. Package manager 3. Download
GhostScript gs / gswin64c apt-get install ghostscript GitHub archive (fallback)
Java (JRE) java Adoptium Temurin 21
VeraPDF Maven Central JAR
  • Binaries are stored in ~/.local/share/pdfa-parser/bin/ (or src/bin/ during development).
  • The JRE and VeraPDF JAR are downloaded once and reused across runs.
  • You can force a specific binary by setting the adapter path manually (see Advanced usage).

Public API reference

Top-level imports

from pdfa_parser import (
    create_parser,      # Factory — recommended entry point
    PdfParser,          # Core class (canonical name)
    PdfaParser,         # Alias for PdfParser
    ValidationResult,   # Dataclass returned by validate()
    DependencyManager,  # Manual dependency orchestration
    # For custom adapters:
    IBaseAdapter,
    BinaryExecuter,
    GhostScriptAdapter,
    VeraPDFAdapter,
)

create_parser(**kwargs) → PdfParser

Parameter Type Default Description
pdfa_level int 2 PDF/A conformance level (1, 2, 3)
with_verapdf bool True Attach VeraPDF for validation
extra_gs_args tuple[str,...] () Extra flags for every GhostScript call

PdfParser methods

Method Returns Description
convert(input, output) Path Convert PDF to PDF/A
validate(file, *, flavour) ValidationResult Check PDF/A compliance via VeraPDF
convert_and_validate(…) ValidationResult Convert then validate in one call
a_convert(…) Path Async convert
a_validate(…) ValidationResult Async validate
a_convert_and_validate(…) ValidationResult Async convert + validate

All path parameters accept both str and pathlib.Path.

ValidationResult

Field Type Description
compliant bool True if the PDF satisfies the profile
profile str Profile name (e.g. "PDF/A-2B …")
details str Raw XML snippet for debugging

Advanced usage

Custom adapters

from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path

class MyGSAdapter(IBaseAdapter):
    def get_binary_path(self) -> Path:
        return Path("/opt/gs-10/bin/gs")

parser = PdfParser(
    gs_executer=BinaryExecuter(MyGSAdapter()),
    pdfa_level=3,
    extra_gs_args=("-dQUIET",),
)

Manual dependency management

from pdfa_parser import DependencyManager

m = DependencyManager()

# Check availability without downloading
print(m.ghostscript.is_available())  # True / False
print(m.verapdf.is_available())

# Force download / resolution
gs_path = m.ensure_ghostscript()
verapdf_path = m.ensure_verapdf()

Project structure

pdfa-parser/
├── src/pdfa_parser/
│   ├── __init__.py             # Public API, create_parser(), PdfaParser alias
│   ├── __main__.py             # python -m pdfa_parser
│   ├── main.py                 # CLI entry-point
│   ├── pdf_parser.py           # PdfParser – convert / validate
│   ├── settings.py             # Lazy binary-path resolution
│   ├── data/
│   │   ├── PDFA_def.ps         # Bundled PostScript for PDF/A OutputIntent
│   │   └── srgb.icc            # Bundled sRGB ICC profile
│   ├── dependencies/
│   │   ├── _base.py            # Dependency / ResolutionStrategy ABCs
│   │   ├── _ghostscript.py     # GhostScript strategies
│   │   ├── _jre.py             # JRE (Adoptium) strategies
│   │   ├── _verapdf.py         # VeraPDF (Maven JAR) strategies
│   │   └── _manager.py         # DependencyManager orchestrator
│   ├── interfaces/
│   │   ├── base_adapter.py     # IBaseAdapter (ABC)
│   │   └── binary_executer.py  # BinaryExecuter (facade)
│   └── implementations/
│       ├── ghostscript_adapter.py
│       └── verapdf_adapter.py
├── tests/
│   ├── conftest.py             # Fixtures, skip markers, PDF generation
│   ├── test_unit.py            # Unit tests (no binaries needed)
│   ├── test_integration.py     # Integration tests (real binaries)
│   ├── test_sample_files.py    # Tests for bundled sample PDFs
│   ├── test_dependencies.py    # Dependency resolution tests
│   └── files/
│       ├── sample_pdf.pdf      # Regular PDF sample
│       └── sample_pdfa.pdf     # PDF/A sample
├── pyproject.toml
├── LICENSE
└── README.md

Testing

# Everything (integration tests auto-skip if binaries are missing)
pytest -v

# Unit tests only (no binaries required)
pytest tests/test_unit.py -v

# Integration + sample file tests
pytest tests/test_integration.py tests/test_sample_files.py -v

Test suites

Suite Tests Requires binaries What it covers
test_unit.py 26 No Helpers, XML parsing, arg building, mocked convert/validate, async, factory
test_dependencies.py 38 No Dependency resolution strategies, DependencyManager, backward-compat shim
test_integration.py 20 Yes Real GS conversion, VeraPDF validation, round-trip, async, multiple PDF types
test_sample_files.py 7 Yes Bundled sample PDFs: conversion, validation, round-trip (sync + async)

Integration tests generate PDFs using reportlab (portrait, landscape, coloured shapes, multi-page, text-heavy) and run them through the full GhostScript → VeraPDF pipeline. Tests are auto-skipped when binaries are not available.

Docker smoke test

# Build the wheel
uv build --wheel

# Run in a clean Python container
docker run --rm \
  -v $PWD/dist:/dist \
  -v $PWD/tests:/tests \
  python:3.13-slim \
  bash -c "pip install /dist/*.whl && python /tests/docker_smoke_test.py"

Requirements

Requirement Version Notes
Python ≥ 3.10 No runtime dependencies beyond the standard library
GhostScript any Auto-installed via apt-get or system PATH
Java (JRE) ≥ 11 Auto-downloaded from Adoptium if missing
VeraPDF 1.26.5 Auto-downloaded from Maven Central

License

GPLv3+

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfa_parser-1.0.1.tar.gz (34.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfa_parser-1.0.1-py3-none-any.whl (41.5 kB view details)

Uploaded Python 3

File details

Details for the file pdfa_parser-1.0.1.tar.gz.

File metadata

  • Download URL: pdfa_parser-1.0.1.tar.gz
  • Upload date:
  • Size: 34.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-1.0.1.tar.gz
Algorithm Hash digest
SHA256 c696c49ec431cbbd5f0bb0e6db27bca1adfcedf6651108c0ca4bfdc47f8c6731
MD5 94a2fddc244a9f4d7b4e1a8936a068b0
BLAKE2b-256 8adc39ee7f9d8d081ff4fc8aae111052df9167a54a3fe691a62390e76d1a2274

See more details on using hashes here.

File details

Details for the file pdfa_parser-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdfa_parser-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 41.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2c657b72c30f253d92897f9e3f4f7457001ca9bf3dfd7bd3ba146a1c633b11a4
MD5 4bf17e2f1ac37a73c4459c95f8a90891
BLAKE2b-256 5672edd491602131d90f4e605e35d48c81ae386b07451ecebdfe052d1e859abd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page