Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.
Project description
pdfa-parser
Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF — zero-config, batteries included.
pdfa-parser is a Python library (Python ≥ 3.10) that wraps GhostScript for
PDF → PDF/A conversion and VeraPDF for conformance validation. All external
tools are downloaded automatically on first use — just pip install and go.
from pdfa_parser import create_parser
parser = create_parser()
parser.convert("input.pdf", "output.pdf")
result = parser.validate("output.pdf")
print(result.compliant) # True
Features
- PDF → PDF/A conversion via GhostScript (levels 1, 2, 3).
- PDF/A validation via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
- Zero config — GhostScript, Java (JRE), and VeraPDF are resolved
automatically (system PATH →
apt-get→ binary download). - Works in Docker —
pip install pdfa-parserin a barepython:3.x-slimimage is all you need. - Sync & async — every public method has an
a_async counterpart. - Factory function (
create_parser()) for instant quick start. - Adapter pattern — swap GhostScript / VeraPDF for any CLI tool by
implementing
IBaseAdapter. - CLI —
pdfa-parser input.pdf output.pdforpython -m pdfa_parser. - Typed — ships with
py.typedmarker and full type annotations.
Installation
pip install pdfa-parser
That's it. No system packages to install, no manual binary setup.
Development install (with test dependencies):
pip install -e ".[dev]"
Quick start
Python API
from pdfa_parser import create_parser
# Create a parser (GhostScript + VeraPDF are auto-resolved)
parser = create_parser()
# Convert a PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")
# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant) # True / False
print(result.profile) # "PDF/A-2B validation profile"
# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant
Tip:
PdfaParseris a convenience alias forPdfParser— both work:from pdfa_parser import PdfaParser # alias from pdfa_parser import PdfParser # canonical name from pdfa_parser import create_parser # recommended factory
Conversion only (no VeraPDF)
parser = create_parser(with_verapdf=False)
parser.convert("input.pdf", "output.pdf")
Async API
Every method has an a_ prefixed async twin:
import asyncio
from pdfa_parser import create_parser
async def main():
parser = create_parser()
await parser.a_convert("input.pdf", "output.pdf")
result = await parser.a_validate("output.pdf")
print(result.compliant)
asyncio.run(main())
CLI
# Basic conversion
pdfa-parser input.pdf output.pdf
# With validation
pdfa-parser input.pdf output.pdf --validate
# PDF/A level 1, flavour 1b
pdfa-parser input.pdf output.pdf --level 1 --validate --flavour 1b
# Also works as a module
python -m pdfa_parser input.pdf output.pdf --validate
How dependency resolution works
On first use, the library checks for each tool in this order:
| Tool | 1. System PATH | 2. Package manager | 3. Download |
|---|---|---|---|
| GhostScript | gs / gswin64c |
apt-get install ghostscript |
GitHub archive (fallback) |
| Java (JRE) | java |
— | Adoptium Temurin 21 |
| VeraPDF | — | — | Maven Central JAR |
- Binaries are stored in
~/.local/share/pdfa-parser/bin/(orsrc/bin/during development). - The JRE and VeraPDF JAR are downloaded once and reused across runs.
- You can force a specific binary by setting the adapter path manually (see Advanced usage).
Public API reference
Top-level imports
from pdfa_parser import (
create_parser, # Factory — recommended entry point
PdfParser, # Core class (canonical name)
PdfaParser, # Alias for PdfParser
ValidationResult, # Dataclass returned by validate()
DependencyManager, # Manual dependency orchestration
# For custom adapters:
IBaseAdapter,
BinaryExecuter,
GhostScriptAdapter,
VeraPDFAdapter,
)
create_parser(**kwargs) → PdfParser
| Parameter | Type | Default | Description |
|---|---|---|---|
pdfa_level |
int |
2 |
PDF/A conformance level (1, 2, 3) |
with_verapdf |
bool |
True |
Attach VeraPDF for validation |
extra_gs_args |
tuple[str,...] |
() |
Extra flags for every GhostScript call |
PdfParser methods
| Method | Returns | Description |
|---|---|---|
convert(input, output) |
Path |
Convert PDF to PDF/A |
validate(file, *, flavour) |
ValidationResult |
Check PDF/A compliance via VeraPDF |
convert_and_validate(…) |
ValidationResult |
Convert then validate in one call |
a_convert(…) |
Path |
Async convert |
a_validate(…) |
ValidationResult |
Async validate |
a_convert_and_validate(…) |
ValidationResult |
Async convert + validate |
All path parameters accept both str and pathlib.Path.
ValidationResult
| Field | Type | Description |
|---|---|---|
compliant |
bool |
True if the PDF satisfies the profile |
profile |
str |
Profile name (e.g. "PDF/A-2B …") |
details |
str |
Raw XML snippet for debugging |
Advanced usage
Custom adapters
from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path
class MyGSAdapter(IBaseAdapter):
def get_binary_path(self) -> Path:
return Path("/opt/gs-10/bin/gs")
parser = PdfParser(
gs_executer=BinaryExecuter(MyGSAdapter()),
pdfa_level=3,
extra_gs_args=("-dQUIET",),
)
Manual dependency management
from pdfa_parser import DependencyManager
m = DependencyManager()
# Check availability without downloading
print(m.ghostscript.is_available()) # True / False
print(m.verapdf.is_available())
# Force download / resolution
gs_path = m.ensure_ghostscript()
verapdf_path = m.ensure_verapdf()
Project structure
pdfa-parser/
├── src/pdfa_parser/
│ ├── __init__.py # Public API, create_parser(), PdfaParser alias
│ ├── __main__.py # python -m pdfa_parser
│ ├── main.py # CLI entry-point
│ ├── pdf_parser.py # PdfParser – convert / validate
│ ├── settings.py # Lazy binary-path resolution
│ ├── data/
│ │ ├── PDFA_def.ps # Bundled PostScript for PDF/A OutputIntent
│ │ └── srgb.icc # Bundled sRGB ICC profile
│ ├── dependencies/
│ │ ├── _base.py # Dependency / ResolutionStrategy ABCs
│ │ ├── _ghostscript.py # GhostScript strategies
│ │ ├── _jre.py # JRE (Adoptium) strategies
│ │ ├── _verapdf.py # VeraPDF (Maven JAR) strategies
│ │ └── _manager.py # DependencyManager orchestrator
│ ├── interfaces/
│ │ ├── base_adapter.py # IBaseAdapter (ABC)
│ │ └── binary_executer.py # BinaryExecuter (facade)
│ └── implementations/
│ ├── ghostscript_adapter.py
│ └── verapdf_adapter.py
├── tests/
│ ├── conftest.py # Fixtures, skip markers, PDF generation
│ ├── test_unit.py # Unit tests (no binaries needed)
│ ├── test_integration.py # Integration tests (real binaries)
│ ├── test_sample_files.py # Tests for bundled sample PDFs
│ ├── test_dependencies.py # Dependency resolution tests
│ └── files/
│ ├── sample_pdf.pdf # Regular PDF sample
│ └── sample_pdfa.pdf # PDF/A sample
├── pyproject.toml
├── LICENSE
└── README.md
Testing
# Everything (integration tests auto-skip if binaries are missing)
pytest -v
# Unit tests only (no binaries required)
pytest tests/test_unit.py -v
# Integration + sample file tests
pytest tests/test_integration.py tests/test_sample_files.py -v
Test suites
| Suite | Tests | Requires binaries | What it covers |
|---|---|---|---|
test_unit.py |
26 | No | Helpers, XML parsing, arg building, mocked convert/validate, async, factory |
test_dependencies.py |
38 | No | Dependency resolution strategies, DependencyManager, backward-compat shim |
test_integration.py |
20 | Yes | Real GS conversion, VeraPDF validation, round-trip, async, multiple PDF types |
test_sample_files.py |
7 | Yes | Bundled sample PDFs: conversion, validation, round-trip (sync + async) |
Integration tests generate PDFs using reportlab (portrait, landscape, coloured shapes, multi-page, text-heavy) and run them through the full GhostScript → VeraPDF pipeline. Tests are auto-skipped when binaries are not available.
Docker smoke test
# Build the wheel
uv build --wheel
# Run in a clean Python container
docker run --rm \
-v $PWD/dist:/dist \
-v $PWD/tests:/tests \
python:3.13-slim \
bash -c "pip install /dist/*.whl && python /tests/docker_smoke_test.py"
Requirements
| Requirement | Version | Notes |
|---|---|---|
| Python | ≥ 3.10 | No runtime dependencies beyond the standard library |
| GhostScript | any | Auto-installed via apt-get or system PATH |
| Java (JRE) | ≥ 11 | Auto-downloaded from Adoptium if missing |
| VeraPDF | 1.26.5 | Auto-downloaded from Maven Central |
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfa_parser-1.1.1.tar.gz.
File metadata
- Download URL: pdfa_parser-1.1.1.tar.gz
- Upload date:
- Size: 35.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6c8376ea671ad6ca2a3beb83a14965a45a21842686bb5b7096b7b60e6af3288
|
|
| MD5 |
d3f8e908f37f300e85494f24188cdbb5
|
|
| BLAKE2b-256 |
9313be2221d77f9d2209c4482aaef6ab0d5469c7dcb13d4c7b6935b61a952bb1
|
File details
Details for the file pdfa_parser-1.1.1-py3-none-any.whl.
File metadata
- Download URL: pdfa_parser-1.1.1-py3-none-any.whl
- Upload date:
- Size: 41.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bd984be2f452a1b0e0e5be014b6f82d87b339e4e4e8d12d358aa418a4e625c3
|
|
| MD5 |
6b1832c7d12292e940cc32cca5c5eea6
|
|
| BLAKE2b-256 |
ca565ad40628a1e4837696ef4c5029ab845243d1e4f1bbafd78daef3aee3816c
|