Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.

These details have not been verified by PyPI

Project links

Project description

pdfa-parser

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF — zero-config, batteries included.

pdfa-parser is a Python library (Python ≥ 3.10) that wraps GhostScript for PDF → PDF/A conversion and VeraPDF for conformance validation. All external tools are downloaded automatically on first use — just pip install and go.

from pdfa_parser import create_parser

parser = create_parser()
parser.convert("input.pdf", "output.pdf")

result = parser.validate("output.pdf")
print(result.compliant)  # True

Features

PDF → PDF/A conversion via GhostScript (levels 1, 2, 3).
PDF/A validation via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
Zero config — GhostScript, Java (JRE), and VeraPDF are resolved automatically (system PATH → apt-get → binary download).
Works in Docker — pip install pdfa-parser in a bare python:3.x-slim image is all you need.
Sync & async — every public method has an a_ async counterpart.
Factory function (create_parser()) for instant quick start.
Adapter pattern — swap GhostScript / VeraPDF for any CLI tool by implementing IBaseAdapter.
CLI — pdfa-parser input.pdf output.pdf or python -m pdfa_parser.
Typed — ships with py.typed marker and full type annotations.

Installation

pip install pdfa-parser

That's it. No system packages to install, no manual binary setup.

Development install (with test dependencies):
pip install -e ".[dev]"

Quick start

Python API

from pdfa_parser import create_parser

# Create a parser (GhostScript + VeraPDF are auto-resolved)
parser = create_parser()

# Convert a PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")

# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant)   # True / False
print(result.profile)     # "PDF/A-2B validation profile"

# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant

Tip: PdfaParser is a convenience alias for PdfParser — both work:

from pdfa_parser import PdfaParser          # alias
from pdfa_parser import PdfParser           # canonical name
from pdfa_parser import create_parser       # recommended factory

Conversion only (no VeraPDF)

parser = create_parser(with_verapdf=False)
parser.convert("input.pdf", "output.pdf")

Async API

Every method has an a_ prefixed async twin:

import asyncio
from pdfa_parser import create_parser

async def main():
    parser = create_parser()
    await parser.a_convert("input.pdf", "output.pdf")
    result = await parser.a_validate("output.pdf")
    print(result.compliant)

asyncio.run(main())

CLI

# Basic conversion
pdfa-parser input.pdf output.pdf

# With validation
pdfa-parser input.pdf output.pdf --validate

# PDF/A level 1, flavour 1b
pdfa-parser input.pdf output.pdf --level 1 --validate --flavour 1b

# Also works as a module
python -m pdfa_parser input.pdf output.pdf --validate

How dependency resolution works

On first use, the library checks for each tool in this order:

Tool	1. System PATH	2. Package manager	3. Download
GhostScript	`gs` / `gswin64c`	`apt-get install ghostscript`	GitHub archive (fallback)
Java (JRE)	`java`	—	Adoptium Temurin 21
VeraPDF	—	—	Maven Central JAR

Binaries are stored in ~/.local/share/pdfa-parser/bin/ (or src/bin/ during development).
The JRE and VeraPDF JAR are downloaded once and reused across runs.
You can force a specific binary by setting the adapter path manually (see Advanced usage).

Public API reference

Top-level imports

from pdfa_parser import (
    create_parser,      # Factory — recommended entry point
    PdfParser,          # Core class (canonical name)
    PdfaParser,         # Alias for PdfParser
    ValidationResult,   # Dataclass returned by validate()
    DependencyManager,  # Manual dependency orchestration
    # For custom adapters:
    IBaseAdapter,
    BinaryExecuter,
    GhostScriptAdapter,
    VeraPDFAdapter,
)

`create_parser(**kwargs) → PdfParser`

Parameter	Type	Default	Description
`pdfa_level`	`int`	`2`	PDF/A conformance level (1, 2, 3)
`with_verapdf`	`bool`	`True`	Attach VeraPDF for validation
`extra_gs_args`	`tuple[str,...]`	`()`	Extra flags for every GhostScript call

`PdfParser` methods

Method	Returns	Description
`convert(input, output)`	`Path`	Convert PDF to PDF/A
`validate(file, *, flavour)`	`ValidationResult`	Check PDF/A compliance via VeraPDF
`convert_and_validate(…)`	`ValidationResult`	Convert then validate in one call
`a_convert(…)`	`Path`	Async convert
`a_validate(…)`	`ValidationResult`	Async validate
`a_convert_and_validate(…)`	`ValidationResult`	Async convert + validate

All path parameters accept both str and pathlib.Path.

`ValidationResult`

Field	Type	Description
`compliant`	`bool`	`True` if the PDF satisfies the profile
`profile`	`str`	Profile name (e.g. `"PDF/A-2B …"`)
`details`	`str`	Raw XML snippet for debugging

Advanced usage

Custom adapters

from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path

class MyGSAdapter(IBaseAdapter):
    def get_binary_path(self) -> Path:
        return Path("/opt/gs-10/bin/gs")

parser = PdfParser(
    gs_executer=BinaryExecuter(MyGSAdapter()),
    pdfa_level=3,
    extra_gs_args=("-dQUIET",),
)

Manual dependency management

from pdfa_parser import DependencyManager

m = DependencyManager()

# Check availability without downloading
print(m.ghostscript.is_available())  # True / False
print(m.verapdf.is_available())

# Force download / resolution
gs_path = m.ensure_ghostscript()
verapdf_path = m.ensure_verapdf()

Project structure

pdfa-parser/
├── src/pdfa_parser/
│   ├── __init__.py             # Public API, create_parser(), PdfaParser alias
│   ├── __main__.py             # python -m pdfa_parser
│   ├── main.py                 # CLI entry-point
│   ├── pdf_parser.py           # PdfParser – convert / validate
│   ├── settings.py             # Lazy binary-path resolution
│   ├── data/
│   │   ├── PDFA_def.ps         # Bundled PostScript for PDF/A OutputIntent
│   │   └── srgb.icc            # Bundled sRGB ICC profile
│   ├── dependencies/
│   │   ├── _base.py            # Dependency / ResolutionStrategy ABCs
│   │   ├── _ghostscript.py     # GhostScript strategies
│   │   ├── _jre.py             # JRE (Adoptium) strategies
│   │   ├── _verapdf.py         # VeraPDF (Maven JAR) strategies
│   │   └── _manager.py         # DependencyManager orchestrator
│   ├── interfaces/
│   │   ├── base_adapter.py     # IBaseAdapter (ABC)
│   │   └── binary_executer.py  # BinaryExecuter (facade)
│   └── implementations/
│       ├── ghostscript_adapter.py
│       └── verapdf_adapter.py
├── tests/
│   ├── conftest.py             # Fixtures, skip markers, PDF generation
│   ├── test_unit.py            # Unit tests (no binaries needed)
│   ├── test_integration.py     # Integration tests (real binaries)
│   ├── test_sample_files.py    # Tests for bundled sample PDFs
│   ├── test_dependencies.py    # Dependency resolution tests
│   └── files/
│       ├── sample_pdf.pdf      # Regular PDF sample
│       └── sample_pdfa.pdf     # PDF/A sample
├── pyproject.toml
├── LICENSE
└── README.md

Testing

# Everything (integration tests auto-skip if binaries are missing)
pytest -v

# Unit tests only (no binaries required)
pytest tests/test_unit.py -v

# Integration + sample file tests
pytest tests/test_integration.py tests/test_sample_files.py -v

Test suites

Suite	Tests	Requires binaries	What it covers
`test_unit.py`	26	No	Helpers, XML parsing, arg building, mocked convert/validate, async, factory
`test_dependencies.py`	38	No	Dependency resolution strategies, DependencyManager, backward-compat shim
`test_integration.py`	20	Yes	Real GS conversion, VeraPDF validation, round-trip, async, multiple PDF types
`test_sample_files.py`	7	Yes	Bundled sample PDFs: conversion, validation, round-trip (sync + async)

Integration tests generate PDFs using reportlab (portrait, landscape, coloured shapes, multi-page, text-heavy) and run them through the full GhostScript → VeraPDF pipeline. Tests are auto-skipped when binaries are not available.

Docker smoke test

# Build the wheel
uv build --wheel

# Run in a clean Python container
docker run --rm \
  -v $PWD/dist:/dist \
  -v $PWD/tests:/tests \
  python:3.13-slim \
  bash -c "pip install /dist/*.whl && python /tests/docker_smoke_test.py"

Requirements

Requirement	Version	Notes
Python	≥ 3.10	No runtime dependencies beyond the standard library
GhostScript	any	Auto-installed via `apt-get` or system PATH
Java (JRE)	≥ 11	Auto-downloaded from Adoptium if missing
VeraPDF	1.26.5	Auto-downloaded from Maven Central

License

GPLv3+

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.1

Apr 14, 2026

1.1.0

Apr 14, 2026

This version

1.0.1

Apr 14, 2026

1.0.0b1 pre-release

Apr 14, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfa_parser-1.0.1.tar.gz (34.9 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfa_parser-1.0.1-py3-none-any.whl (41.5 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file pdfa_parser-1.0.1.tar.gz.

File metadata

Download URL: pdfa_parser-1.0.1.tar.gz
Upload date: Apr 14, 2026
Size: 34.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c696c49ec431cbbd5f0bb0e6db27bca1adfcedf6651108c0ca4bfdc47f8c6731`
MD5	`94a2fddc244a9f4d7b4e1a8936a068b0`
BLAKE2b-256	`8adc39ee7f9d8d081ff4fc8aae111052df9167a54a3fe691a62390e76d1a2274`

See more details on using hashes here.

File details

Details for the file pdfa_parser-1.0.1-py3-none-any.whl.

File metadata

Download URL: pdfa_parser-1.0.1-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 41.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c657b72c30f253d92897f9e3f4f7457001ca9bf3dfd7bd3ba146a1c633b11a4`
MD5	`4bf17e2f1ac37a73c4459c95f8a90891`
BLAKE2b-256	`5672edd491602131d90f4e605e35d48c81ae386b07451ecebdfe052d1e859abd`

See more details on using hashes here.

pdfa-parser 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdfa-parser

Features

Installation

Quick start

Python API

Conversion only (no VeraPDF)

Async API

CLI

How dependency resolution works

Public API reference

Top-level imports

create_parser(**kwargs) → PdfParser

PdfParser methods

ValidationResult

Advanced usage

Custom adapters

Manual dependency management

Project structure

Testing

Test suites

Docker smoke test

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`create_parser(**kwargs) → PdfParser`

`PdfParser` methods

`ValidationResult`