Skip to main content

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.

Project description

pdfa-parser

Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.

pdfa-parser is a lightweight Python library (Python ≥ 3.14) that wraps GhostScript for PDF → PDF/A conversion and VeraPDF for conformance validation. Both synchronous and asynchronous APIs are provided out of the box.


Features

  • PDF → PDF/A conversion via GhostScript (levels 1, 2, 3).
  • PDF/A validation via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
  • Sync & async – every public method has an a_ async counterpart.
  • Factory function (create_parser()) for zero-config quick start.
  • Adapter pattern – swap GhostScript/VeraPDF for any CLI tool by implementing IBaseAdapter.
  • CLIpython -m pdfa_parser input.pdf output.pdf.

Project structure

pdfa-parser/
├── scripts/
│   └── setup_binaries.py          # Download & install GhostScript + VeraPDF
├── src/
│   ├── bin/
│   │   ├── ghostscript/            # GhostScript binary (gswin64c.exe / gs)
│   │   └── verapdf/                # VeraPDF CLI (verapdf.bat / verapdf)
│   └── pdfa_parser/
│       ├── __init__.py             # Public API + create_parser()
│       ├── __main__.py             # python -m pdfa_parser
│       ├── main.py                 # CLI entry-point
│       ├── pdf_parser.py           # PdfParser – convert / validate
│       ├── settings.py             # Binary path resolution
│       ├── interfaces/
│       │   ├── base_adapter.py     # IBaseAdapter (ABC)
│       │   └── binary_executer.py  # BinaryExecuter (facade)
│       └── implementations/
│           ├── ghostscript_adapter.py
│           └── verapdf_adapter.py
├── tests/
│   ├── conftest.py                 # Fixtures: PDF generation, skip markers
│   ├── test_unit.py                # 26 unit tests (no binaries needed)
│   └── test_integration.py         # 20 integration tests (need binaries)
├── pyproject.toml
└── README.md

Requirements

Dependency Required for Notes
Python ≥ 3.14 Everything Uses match/case, type | union, etc.
GhostScript Conversion gswin64c.exe (Win) or gs (Unix)
VeraPDF Validation Requires Java ≥ 11 on PATH

Installation

1. Clone & create virtual environment

git clone <repo-url> pdfa-parser
cd pdfa-parser
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activate

2. Install the package (editable + dev dependencies)

# Using uv (recommended)
uv pip install -e ".[dev]"

# Or plain pip
pip install -e ".[dev]"

3. Install binaries

The automated setup script downloads GhostScript and VeraPDF into src/bin/:

python scripts/setup_binaries.py          # both
python scripts/setup_binaries.py --gs      # GhostScript only
python scripts/setup_binaries.py --verapdf # VeraPDF only

Prerequisites for the script:

Binary Windows requirement Unix requirement
GhostScript 7-Zip on PATH (7z) tar (pre-installed)
VeraPDF Java ≥ 11 on PATH Java ≥ 11 on PATH

Tip: You can also install GhostScript / VeraPDF manually and copy (or symlink) the executables into src/bin/ghostscript/ and src/bin/verapdf/.


Quick start

Python API

from pdfa_parser import create_parser

# Create a parser with default adapters
parser = create_parser()                    # GhostScript + VeraPDF
parser = create_parser(with_verapdf=False)  # GhostScript only

# Convert PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")

# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant)   # True / False
print(result.profile)     # "PDF/A-2B validation profile"

# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant

Async API

Every method has an a_ prefix async twin:

import asyncio
from pdfa_parser import create_parser

async def main():
    parser = create_parser()
    await parser.a_convert("input.pdf", "output.pdf")
    result = await parser.a_validate("output.pdf")
    print(result.compliant)

asyncio.run(main())

CLI

# Basic conversion
python -m pdfa_parser input.pdf output.pdf

# With validation
python -m pdfa_parser input.pdf output.pdf --validate

# PDF/A level 1, flavour 1b
python -m pdfa_parser input.pdf output.pdf --level 1 --validate --flavour 1b

Advanced usage

Custom adapters

from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path

class MyGSAdapter(IBaseAdapter):
    def get_binary_path(self) -> Path:
        return Path("/usr/local/bin/gs")

parser = PdfParser(
    gs_executer=BinaryExecuter(MyGSAdapter()),
    pdfa_level=3,
    extra_gs_args=("-dQUIET",),
)

Custom PDF/A level & extra GhostScript flags

parser = create_parser(pdfa_level=3, extra_gs_args=("-dQUIET", "-r300"))

Testing

# Unit tests (no binaries required) – always runnable
pytest tests/test_unit.py -v

# Integration tests (require GhostScript + VeraPDF in src/bin/)
pytest tests/test_integration.py -v

# Everything
pytest -v

Test coverage summary

Suite Tests Requires binaries What it covers
test_unit.py 26 No Helpers, XML parsing, arg building, mocked convert/validate, async, factory
test_integration 20 Yes Real conversion (portrait, landscape, color, multi-page, text-heavy), VeraPDF validation, round-trip, async, PDF/A-1b

Integration tests generate various PDF types using reportlab (portrait, landscape, coloured shapes, multi-page, text-heavy) and run them through the full GhostScript → VeraPDF pipeline. Tests are auto-skipped when binaries are missing.


License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfa_parser-0.1.0-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file pdfa_parser-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdfa_parser-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pdfa_parser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41953b50bab59a2f3ff45535ca4166a4533d29313ea91ed5e455965a052957b8
MD5 d655b0210f2b74643d41d0a518e10ed5
BLAKE2b-256 6d4d19a914c57b279ef07a78de1b043247a5c458136f8c53b8decfcda7d60394

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page