Python library to extract and fill PDF forms using pypdf
Project description
privacyforms-pdf
Python library for extracting and filling PDF forms using pypdf.
Features
- Extract form data from PDF files using pure Python (no external dependencies)
- Fill PDF forms programmatically
- Extract field geometry (position and size) information
- Command-line interface with multiple commands
- Full type hints and comprehensive test coverage (99%)
- Support for all form field types (text, date, checkbox, radio button groups, etc.)
Requirements
- Python 3.14+
- pypdf >= 5.0
Installation
# Clone the repository
git clone <repo-url>
cd privacyforms-pdf
# Install with uv
uv sync
Quick Start
Check CLI is ready
pdf-forms check
Command Line Usage
# Check if a PDF contains a form
pdf-forms info form.pdf
# List all form fields
pdf-forms list-fields form.pdf
# Get a specific field value
pdf-forms get-value form.pdf "Field Name"
# Extract form data to JSON
pdf-forms extract form.pdf -o output.json
# Extract form data to stdout
pdf-forms extract form.pdf
# Fill a form from JSON (validates before filling)
pdf-forms fill-form form.pdf data.json -o filled.pdf
# Fill a form without validation
pdf-forms fill-form form.pdf data.json -o filled.pdf --no-validate
# Fill a form in-place (modifies original)
pdf-forms fill-form form.pdf data.json
# Fill with strict mode (requires all form fields)
pdf-forms fill-form form.pdf data.json -o filled.pdf --strict
JSON Format
The fill-form command accepts a simple key:value JSON format where keys are field names and values are the values to fill:
{
"Candidate Name": "John Smith",
"Position": "Software Engineer",
"Start date": "2025-06-01",
"Full time": true,
"Diploma or GED": "Yes"
}
Python API
from privacyforms_pdf import PDFFormExtractor
# Initialize the extractor
extractor = PDFFormExtractor()
# Extract form data
form_data = extractor.extract("form.pdf")
# Access form information
print(f"PDF Version: {form_data.pdf_version}")
print(f"Has Form: {form_data.has_form}")
print(f"Total Fields: {len(form_data.fields)}")
# Iterate over fields
for field in form_data.fields:
print(f"{field.name}: {field.value}")
# Get specific field value
value = extractor.get_field_value("form.pdf", "Field Name")
# Check if PDF has a form
has_form = extractor.has_form("form.pdf")
# Export to JSON file
extractor.extract_to_json("form.pdf", "output.json")
# Fill a form using simple key:value format
form_data = {
"Candidate Name": "John Smith",
"Position": "Software Engineer",
"Full time": True,
"Start date": "2025-06-01"
}
extractor.fill_form("form.pdf", form_data, "filled.pdf")
# Or fill from a JSON file
extractor.fill_form_from_json("form.pdf", "data.json", "filled.pdf")
# Validate data before filling (returns list of errors)
errors = extractor.validate_form_data("form.pdf", form_data)
if errors:
print("Validation errors:", errors)
API Reference
PDFFormExtractor
The main class for extracting and filling PDF form data.
Constructor
extractor = PDFFormExtractor(
timeout_seconds: float = 30.0,
extract_geometry: bool = True
)
timeout_seconds: Timeout for operations (kept for API compatibility).extract_geometry: Whether to extract field geometry information.
Methods
has_form(pdf_path: str | Path) -> bool: Check if a PDF contains a form.extract(pdf_path: str | Path) -> PDFFormData: Extract form data from a PDF.extract_to_json(pdf_path: str | Path, output_path: str | Path) -> None: Export form data to a JSON file.list_fields(pdf_path: str | Path) -> list[PDFField]: List all form fields in a PDF.get_field_value(pdf_path: str | Path, field_name: str) -> str | bool | None: Get the value of a specific form field.get_field_by_id(pdf_path: str | Path, field_id: str) -> PDFField | None: Get a form field by its ID.get_field_by_name(pdf_path: str | Path, field_name: str) -> PDFField | None: Get a form field by its name.validate_form_data(pdf_path: str | Path, form_data: dict, *, strict: bool = False, allow_extra_fields: bool = False) -> list[str]: Validate form data (simple key:value format).fill_form(pdf_path: str | Path, form_data: dict, output_path: str | Path | None = None, *, validate: bool = True) -> Path: Fill a PDF form with data.fill_form_from_json(pdf_path: str | Path, json_path: str | Path, output_path: str | Path | None = None, *, validate: bool = True) -> Path: Fill a PDF form with data from a JSON file.
Data Classes
PDFFormData
Represents extracted PDF form data.
source: Path: Path to the source PDF file.pdf_version: str: Version of the PDF.has_form: bool: Whether the PDF contains a form.fields: list[PDFField]: List of form fields.raw_data: dict[str, Any]: The raw data from pypdf.
PDFField
Represents a single form field.
name: str: The name of the field.id: str: The unique identifier of the field.field_type: str: The type of the form field (e.g., 'textfield', 'checkbox').value: str | bool: The current value of the field.pages: list[int]: List of pages where this field appears.locked: bool: Whether the field is locked.geometry: FieldGeometry | None: Optional geometry information (position and size).format: str | None: Date format for datefield types.options: list[str]: Available options for radiobuttongroup, combobox, listbox types.
FieldGeometry
Represents the geometry (position and size) of a form field.
page: int: 1-based page number where field appears.rect: tuple[float, float, float, float]: Bounding box as (x1, y1, x2, y2) in PDF points.x: float: Left coordinate.y: float: Bottom coordinate (PDF coordinate system).width: float: Field width in points.height: float: Field height in points.units: str: Unit of measurement (always "pt" for points).
JSON Export Format
When using pdf-forms extract or extract_to_json(), the output JSON has the following structure:
{
"source": "path/to/form.pdf",
"pdf_version": "1.7",
"has_form": true,
"fields": [
{
"name": "Field Name",
"id": "1",
"field_type": "textfield",
"value": "Field Value",
"pages": [1],
"locked": false,
"geometry": {
"page": 1,
"rect": [53.0, 1077.0, 414.0, 1104.0],
"x": 53.0,
"y": 1077.0,
"width": 361.0,
"height": 27.0,
"units": "pt"
},
"format": null,
"options": []
}
]
}
Field Types:
textfield: Text input fieldsdatefield: Date input fields (may includeformatattribute)checkbox: Boolean/checkbox fields (value istrueorfalse)radiobuttongroup: Radio button groups (may includeoptionsarray)combobox: Dropdown/combo boxes (may includeoptionsarray)listbox: List selection boxes (may includeoptionsarray)signature: Signature fields
Geometry:
The geometry object contains the field's position and size in PDF points (1/72 inch):
rect: Array of[x0, y0, x1, y1]coordinatesx,y: Bottom-left corner positionwidth,height: Field dimensions- Note: PDF coordinates have origin (0,0) at bottom-left of the page
width: float: Field width in points.height: float: Field height in points.
Exceptions
PDFFormError: Base exception for PDF form related errors.PDFFormNotFoundError: Raised when the PDF does not contain any forms.FormValidationError: Raised when form data validation fails.FieldNotFoundError: Raised when a field is not found in the form.
Note: For backwards compatibility, the following aliases are still available but deprecated:
PDFCPUError(alias forPDFFormError)PDFCPUNotFoundError(alias forPDFFormError)PDFCPUExecutionError(alias forPDFFormError)
Development
Running Tests
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov
# Run linting
uv run ruff check .
# Run type checking
uv run ty check
Project Structure
privacyforms-pdf/
├── privacyforms_pdf/ # Main package
│ ├── __init__.py # Package exports
│ ├── extractor.py # PDFFormExtractor implementation
│ └── cli.py # Command-line interface
├── tests/ # Test suite
│ ├── test_extractor.py # Tests for extractor
│ └── test_cli.py # Tests for CLI
├── pyproject.toml # Project configuration
└── README.md # This file
License
Copyright 2026 Andreas Jung (info@zopyx.com)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file privacyforms_pdf-0.1.3.tar.gz.
File metadata
- Download URL: privacyforms_pdf-0.1.3.tar.gz
- Upload date:
- Size: 4.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01c0dd4a3565a856221a98bd25ca458f85bc5e951a96cbfb00848245c11c9da0
|
|
| MD5 |
7110c6979bad1ef69aa87724b06ff190
|
|
| BLAKE2b-256 |
b3862bd3fc1281b2489d65d3ea3977be4026bddbf4373c52786656d69e199117
|
File details
Details for the file privacyforms_pdf-0.1.3-py3-none-any.whl.
File metadata
- Download URL: privacyforms_pdf-0.1.3-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37b310e79c1405f17a6338c6f52297fdb07b8f47ab85adf69be74a02188fafa3
|
|
| MD5 |
ac2cf5c03da812bce54671a1968e185b
|
|
| BLAKE2b-256 |
f5f02b111165f3281b739728e94d83e29e2c14617e3918bc0b66423549fbeefb
|