Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.
Project description
pdfa-parser
Convert PDFs to PDF/A using GhostScript and validate compliance with VeraPDF.
pdfa-parser is a lightweight Python library (Python ≥ 3.14) that wraps GhostScript for PDF → PDF/A conversion and VeraPDF for conformance validation. Both synchronous and asynchronous APIs are provided out of the box.
Features
- PDF → PDF/A conversion via GhostScript (levels 1, 2, 3).
- PDF/A validation via VeraPDF (flavours 1a/1b, 2a/2b, 3a/3b, …).
- Sync & async – every public method has an
a_async counterpart. - Factory function (
create_parser()) for zero-config quick start. - Adapter pattern – swap GhostScript/VeraPDF for any CLI tool by
implementing
IBaseAdapter. - CLI –
python -m pdfa_parser input.pdf output.pdf.
Project structure
pdfa-parser/
├── scripts/
│ └── setup_binaries.py # Download & install GhostScript + VeraPDF
├── src/
│ ├── bin/
│ │ ├── ghostscript/ # GhostScript binary (gswin64c.exe / gs)
│ │ └── verapdf/ # VeraPDF CLI (verapdf.bat / verapdf)
│ └── pdfa_parser/
│ ├── __init__.py # Public API + create_parser()
│ ├── __main__.py # python -m pdfa_parser
│ ├── main.py # CLI entry-point
│ ├── pdf_parser.py # PdfParser – convert / validate
│ ├── settings.py # Binary path resolution
│ ├── interfaces/
│ │ ├── base_adapter.py # IBaseAdapter (ABC)
│ │ └── binary_executer.py # BinaryExecuter (facade)
│ └── implementations/
│ ├── ghostscript_adapter.py
│ └── verapdf_adapter.py
├── tests/
│ ├── conftest.py # Fixtures: PDF generation, skip markers
│ ├── test_unit.py # 26 unit tests (no binaries needed)
│ └── test_integration.py # 20 integration tests (need binaries)
├── pyproject.toml
└── README.md
Requirements
| Dependency | Required for | Notes |
|---|---|---|
| Python ≥ 3.14 | Everything | Uses match/case, type | union, etc. |
| GhostScript | Conversion | gswin64c.exe (Win) or gs (Unix) |
| VeraPDF | Validation | Requires Java ≥ 11 on PATH |
Installation
1. Clone & create virtual environment
git clone <repo-url> pdfa-parser
cd pdfa-parser
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activate
2. Install the package (editable + dev dependencies)
# Using uv (recommended)
uv pip install -e ".[dev]"
# Or plain pip
pip install -e ".[dev]"
3. Install binaries
The automated setup script downloads GhostScript and VeraPDF into src/bin/:
python scripts/setup_binaries.py # both
python scripts/setup_binaries.py --gs # GhostScript only
python scripts/setup_binaries.py --verapdf # VeraPDF only
Prerequisites for the script:
| Binary | Windows requirement | Unix requirement |
|---|---|---|
| GhostScript | 7-Zip on PATH (7z) |
tar (pre-installed) |
| VeraPDF | Java ≥ 11 on PATH |
Java ≥ 11 on PATH |
Tip: You can also install GhostScript / VeraPDF manually and copy (or symlink) the executables into
src/bin/ghostscript/andsrc/bin/verapdf/.
Quick start
Python API
from pdfa_parser import create_parser
# Create a parser with default adapters
parser = create_parser() # GhostScript + VeraPDF
parser = create_parser(with_verapdf=False) # GhostScript only
# Convert PDF to PDF/A-2
parser.convert("input.pdf", "output_pdfa.pdf")
# Validate a file
result = parser.validate("output_pdfa.pdf", flavour="2b")
print(result.compliant) # True / False
print(result.profile) # "PDF/A-2B validation profile"
# One-shot: convert then validate
result = parser.convert_and_validate("input.pdf", "output_pdfa.pdf")
assert result.compliant
Async API
Every method has an a_ prefix async twin:
import asyncio
from pdfa_parser import create_parser
async def main():
parser = create_parser()
await parser.a_convert("input.pdf", "output.pdf")
result = await parser.a_validate("output.pdf")
print(result.compliant)
asyncio.run(main())
CLI
# Basic conversion
python -m pdfa_parser input.pdf output.pdf
# With validation
python -m pdfa_parser input.pdf output.pdf --validate
# PDF/A level 1, flavour 1b
python -m pdfa_parser input.pdf output.pdf --level 1 --validate --flavour 1b
Advanced usage
Custom adapters
from pdfa_parser import IBaseAdapter, BinaryExecuter, PdfParser
from pathlib import Path
class MyGSAdapter(IBaseAdapter):
def get_binary_path(self) -> Path:
return Path("/usr/local/bin/gs")
parser = PdfParser(
gs_executer=BinaryExecuter(MyGSAdapter()),
pdfa_level=3,
extra_gs_args=("-dQUIET",),
)
Custom PDF/A level & extra GhostScript flags
parser = create_parser(pdfa_level=3, extra_gs_args=("-dQUIET", "-r300"))
Testing
# Unit tests (no binaries required) – always runnable
pytest tests/test_unit.py -v
# Integration tests (require GhostScript + VeraPDF in src/bin/)
pytest tests/test_integration.py -v
# Everything
pytest -v
Test coverage summary
| Suite | Tests | Requires binaries | What it covers |
|---|---|---|---|
test_unit.py |
26 | No | Helpers, XML parsing, arg building, mocked convert/validate, async, factory |
test_integration |
20 | Yes | Real conversion (portrait, landscape, color, multi-page, text-heavy), VeraPDF validation, round-trip, async, PDF/A-1b |
Integration tests generate various PDF types using reportlab (portrait, landscape, coloured shapes, multi-page, text-heavy) and run them through the full GhostScript → VeraPDF pipeline. Tests are auto-skipped when binaries are missing.
License
See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfa_parser-1.0.0b1.tar.gz.
File metadata
- Download URL: pdfa_parser-1.0.0b1.tar.gz
- Upload date:
- Size: 33.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b176176b8b8ca2cfb18d81f9d374635f2a72922bfa22d7eeac9dc3714dc6db37
|
|
| MD5 |
534b74520cc31ecd6e721df4daae10de
|
|
| BLAKE2b-256 |
236c3b51dfde4e5e635d3268f2a3722f979dded1fe9dae6bd90d728bd2cbf04e
|
File details
Details for the file pdfa_parser-1.0.0b1-py3-none-any.whl.
File metadata
- Download URL: pdfa_parser-1.0.0b1-py3-none-any.whl
- Upload date:
- Size: 39.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
445aec808b2cd0c2566790ab86bfad49fef9244207df3a205a10a8f0ad26f831
|
|
| MD5 |
e1f315a4846c1eef7dd4f59471634dd3
|
|
| BLAKE2b-256 |
2e5d820f5ac7974b070dfad4c36bd83bf6da7e5d0be48cb33579f4406f98d719
|