Skip to main content

Read entered values out of XFA / LiveCycle 'dynamic' PDF forms (IRCC and other government forms) that standard PDF field extraction misses.

Project description

xfa-extract

PyPI version Python versions License: MIT

Read the entered values out of XFA / LiveCycle "dynamic" PDF forms — the ones where pypdf.get_fields() comes back empty even though the form is clearly filled in.

If you've ever hit this:

I filled out a government PDF (an IRCC immigration form, a tax form, …), but when I run PdfReader(...).get_fields() or pdftk dump_data_fields, the values are blank or missing. The form template text extracts fine, but none of the answers show up. Or the PDF just shows "Please wait… If this message is not eventually replaced…".

…then your PDF is an XFA form, and this tool reads it.

Why standard extraction misses the data

A normal interactive PDF (an AcroForm) stores each field's value in its /V entry — pypdf, pdftk, pdfminer all read those fine.

An XFA form (Adobe LiveCycle / "dynamic" PDF — what most government and immigration forms are) does not keep the entered data in /V. It keeps it in an XML packet inside the AcroForm dictionary under the /XFA key, in a sub-packet called datasets. So get_fields() and text extraction look blank even on a fully completed form. xfa-extract detects XFA, pulls the datasets packet, parses it, and gives you the field → value map.

Read-only. It never writes to or mutates your PDF.

Install

pip install xfa-extract              # core (pypdf + lxml)
pip install "xfa-extract[robust]"    # + pikepdf fallback for unusual PDFs

Use it — command line

xfa-extract FORM.pdf                 # human-readable tree + flat "path: value" table
xfa-extract FORM.pdf --json          # machine-readable JSON (for scripts / LLMs)
xfa-extract FORM.pdf --flatten       # just the path: value table

Every run also writes the raw datasets XML to --raw-out (default ./xfa_datasets.xml) for auditing.

$ xfa-extract application.pdf --flatten
form1.PersonalInfo.Surname           Smith
form1.PersonalInfo.GivenName         Jane
form1.Dependents.Dependent[0].Name   Alex
form1.Dependents.Dependent[1].Name   Sam

Repeating sections (multiple dependents, applicants, addresses) are indexed (Dependent[0], Dependent[1], …), never collapsed.

Use it — as a library

from xfa_extract import locate_datasets, parse_datasets

kind, datasets, _packets, _engine = locate_datasets("FORM.pdf")
if kind == "xfa" and datasets:
    tree, flat = parse_datasets(datasets)
    print(flat["form1.PersonalInfo.Surname"])   # -> "Smith"

Understand the form's schema, not just its values

XFA forms also carry a template packet — the form's intelligence. parse_template() turns it into a per-field schema: field kind, the human caption, a dropdown/radio's valid values (export code ↔ display label), the expected format, and whether the field runs scripts:

from xfa_extract import parse_template, schema_for

schema = parse_template("FORM.pdf")
f = schema_for(schema, "form1.PersonalInfo.Country")
f.kind        # "choice"
f.caption     # "Country of birth or territory"
f.choices     # [("1", "Canada"), ("2", "Other")]  — datasets stores the export code
f.picture     # e.g. "date{YYYY-MM-DD}" on date fields
f.scripted    # True if the field has calculate/validate/event scripts

This is what lets a filler (see xfa-fill) accept "Canada" and write the "1" the form actually stores.

Exit codes (the CLI tells you which case you're in)

code meaning what to do
0 XFA data extracted (≥1 non-empty value) use the values
2 not XFA — AcroForm-only or no form use get_fields(); the tool prints those values for you as a convenience
3 XFA but no datasets, or the form is empty/unfilled report "unfilled / no entered data"
4 parse failure the raw XML is still written to --raw-out for inspection

Tested against real forms

Validated on synthetic fixtures (in CI) and 14 real-world XFA forms — IRCC IMM5257 / 1295 / 1344 / 5710 / 5669, a DHL waybill, an Indian MCA MGT-7, an Ontario lease, a US DOL form, French CERFA — plus a real filled Canadian Proof-of-Citizenship application. Repeating sections, namespaces, Adobe's quirky tag serialization, and base64-image-bearing datasets all handled. See skill/REFERENCE.md for the deep dive.

What it does not do

  • Fill / write values into XFA forms — that's the job of the companion package xfa-fill, which uses this package's template schema and read-back verification.
  • Flatten / render XFA to static pages — different operation.
  • OCR — these are digital forms, not scans.

Use it with Claude / Claude Code

This repo also ships an Agent Skill so Claude Code automatically reaches for it when a fillable PDF's values come back blank. Point Claude at skill/.

License

MIT © Ryan Kashtan. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xfa_extract-0.2.0.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xfa_extract-0.2.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file xfa_extract-0.2.0.tar.gz.

File metadata

  • Download URL: xfa_extract-0.2.0.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xfa_extract-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2e424c3e23b47447304d0f538b762b01830d50dd35ac1837a05a5791e0fefc39
MD5 786216380dd8a68a7a216ad572585655
BLAKE2b-256 85aaa5e85eb70d05cf73a0b38332b57f00eb272c08d811845933180ce8091a3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for xfa_extract-0.2.0.tar.gz:

Publisher: release.yml on ryanjkashtan/xfa-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file xfa_extract-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: xfa_extract-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xfa_extract-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e23c9cecf75b8fd4f57b18bf919ea43a7f482bdbfce97ee0cc36fb0489161deb
MD5 1c19b0caca91be9dc6c5ed4ae2be27ec
BLAKE2b-256 4a816a119be3cfd1b22526edd46a5ee4c6ca2f1f61b1097440809bab9f777f50

See more details on using hashes here.

Provenance

The following attestation bundles were made for xfa_extract-0.2.0-py3-none-any.whl:

Publisher: release.yml on ryanjkashtan/xfa-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page