Skip to main content

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.

Project description

pdfa-parser

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.

pdfa-parser is a lightweight Python library (Python ≥ 3.14) that wraps GhostScript for PDF → PDF/A conversion and VeraPDF for conformance validation. Both synchronous and asynchronous APIs are provided out of the box.


Features

  • PDF → PDF/A conversion via GhostScript (levels 1, 2, 3).
  • PDF/A validation via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
  • Sync & async – every public method has an a_ async counterpart.
  • Factory function (create_parser()) for zero-config quick start.
  • Adapter pattern – swap GhostScript/VeraPDF for any CLI tool by implementing IBaseAdapter.
  • CLIpython -m pdfa_parser input.pdf output.pdf.

Project structure

pdfa-parser/
├── scripts/
│   └── setup_binaries.py          # Download & install GhostScript + VeraPDF
├── src/
│   ├── bin/
│   │   ├── ghostscript/            # GhostScript binary (gswin64c.exe / gs)
│   │   └── verapdf/                # VeraPDF CLI (verapdf.bat / verapdf)
│   └── pdfa_parser/
│       ├── __init__.py             # Public API + create_parser()
│       ├── __main__.py             # python -m pdfa_parser
│       ├── main.py                 # CLI entry-point
│       ├── pdf_parser.py           # PdfParser – convert / validate
│       ├── settings.py             # Binary path resolution
│       ├── interfaces/
│       │   ├── base_adapter.py     # IBaseAdapter (ABC)
│       │   └── binary_executer.py  # BinaryExecuter (facade)
│       └── implementations/
│           ├── ghostscript_adapter.py
│           └── verapdf_adapter.py
├── tests/
│   ├── conftest.py                 # Fixtures: PDF generation, skip markers
│   ├── test_unit.py                # 26 unit tests (no binaries needed)
│   └── test_integration.py         # 20 integration tests (need binaries)
├── pyproject.toml
└── README.md

Requirements

Dependency Required for Notes
Python ≥ 3.14 Everything Uses match/case, type | union, etc.
GhostScript Conversion gswin64c.exe (Win) or gs (Unix)
VeraPDF Validation Requires Java ≥ 11 on PATH

Installation

1. Clone & create virtual environment

git clone <repo-url> pdfa-parser
cd pdfa-parser
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activate

2. Install the package (editable + dev dependencies)

# Using uv (recommended)
uv pip install -e ".[dev]"

# Or plain pip
pip install -e ".[dev]"

3. Install binaries

The automated setup script downloads GhostScript and VeraPDF into src/bin/:

python scripts/setup_binaries.py          # both
python scripts/setup_binaries.py --gs      # GhostScript only
python scripts/setup_binaries.py --verapdf # VeraPDF only

Prerequisites for the script:

Binary Windows requirement Unix requirement
GhostScript 7-Zip on PATH (7z) tar (pre-installed)
VeraPDF Java ≥ 11 on PATH Java ≥ 11 on PATH

Tip: You can also install GhostScript / VeraPDF manually and copy (or symlink) the executables into src/bin/ghostscript/ and src/bin/verapdf/.


Quick start

Python API

from pdfa_parser import create_parser

# Create a parser with default adapters
parser = create_parser()                    # GhostScript + VeraPDF
parser = create_parser(with_verapdf=False)  # GhostScript only

# Convert PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")

# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant)   # True / False
print(result.profile)     # "PDF/A-2B validation profile"

# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant

Async API

Every method has an a_ prefix async twin:

import asyncio
from pdfa_parser import create_parser

async def main():
    parser = create_parser()
    await parser.a_convert("input.pdf", "output.pdf")
    result = await parser.a_validate("output.pdf")
    print(result.compliant)

asyncio.run(main())

CLI

# Basic conversion
python -m pdfa_parser input.pdf output.pdf

# With validation
python -m pdfa_parser input.pdf output.pdf --validate

# PDF/A level 1, flavour 1b
python -m pdfa_parser input.pdf output.pdf --level 1 --validate --flavour 1b

Advanced usage

Custom adapters

from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path

class MyGSAdapter(IBaseAdapter):
    def get_binary_path(self) -> Path:
        return Path("/usr/local/bin/gs")

parser = PdfParser(
    gs_executer=BinaryExecuter(MyGSAdapter()),
    pdfa_level=3,
    extra_gs_args=("-dQUIET",),
)

Custom PDF/A level & extra GhostScript flags

parser = create_parser(pdfa_level=3, extra_gs_args=("-dQUIET", "-r300"))

Testing

# Unit tests (no binaries required) – always runnable
pytest tests/test_unit.py -v

# Integration tests (require GhostScript + VeraPDF in src/bin/)
pytest tests/test_integration.py -v

# Everything
pytest -v

Test coverage summary

Suite Tests Requires binaries What it covers
test_unit.py 26 No Helpers, XML parsing, arg building, mocked convert/validate, async, factory
test_integration 20 Yes Real conversion (portrait, landscape, color, multi-page, text-heavy), VeraPDF validation, round-trip, async, PDF/A-1b

Integration tests generate various PDF types using reportlab (portrait, landscape, coloured shapes, multi-page, text-heavy) and run them through the full GhostScript → VeraPDF pipeline. Tests are auto-skipped when binaries are missing.


License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfa_parser-1.0.0b1.tar.gz (33.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfa_parser-1.0.0b1-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file pdfa_parser-1.0.0b1.tar.gz.

File metadata

  • Download URL: pdfa_parser-1.0.0b1.tar.gz
  • Upload date:
  • Size: 33.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-1.0.0b1.tar.gz
Algorithm Hash digest
SHA256 b176176b8b8ca2cfb18d81f9d374635f2a72922bfa22d7eeac9dc3714dc6db37
MD5 534b74520cc31ecd6e721df4daae10de
BLAKE2b-256 236c3b51dfde4e5e635d3268f2a3722f979dded1fe9dae6bd90d728bd2cbf04e

See more details on using hashes here.

File details

Details for the file pdfa_parser-1.0.0b1-py3-none-any.whl.

File metadata

  • Download URL: pdfa_parser-1.0.0b1-py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-1.0.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 445aec808b2cd0c2566790ab86bfad49fef9244207df3a205a10a8f0ad26f831
MD5 e1f315a4846c1eef7dd4f59471634dd3
BLAKE2b-256 2e5d820f5ac7974b070dfad4c36bd83bf6da7e5d0be48cb33579f4406f98d719

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page