Skip to main content

Read entered values out of XFA / LiveCycle 'dynamic' PDF forms (IRCC and other government forms) that standard PDF field extraction misses.

Project description

xfa-extract

PyPI version Python versions License: MIT

Read the entered values out of XFA / LiveCycle "dynamic" PDF forms — the ones where pypdf.get_fields() comes back empty even though the form is clearly filled in.

If you've ever hit this:

I filled out a government PDF (an IRCC immigration form, a tax form, …), but when I run PdfReader(...).get_fields() or pdftk dump_data_fields, the values are blank or missing. The form template text extracts fine, but none of the answers show up. Or the PDF just shows "Please wait… If this message is not eventually replaced…".

…then your PDF is an XFA form, and this tool reads it.

Why standard extraction misses the data

A normal interactive PDF (an AcroForm) stores each field's value in its /V entry — pypdf, pdftk, pdfminer all read those fine.

An XFA form (Adobe LiveCycle / "dynamic" PDF — what most government and immigration forms are) does not keep the entered data in /V. It keeps it in an XML packet inside the AcroForm dictionary under the /XFA key, in a sub-packet called datasets. So get_fields() and text extraction look blank even on a fully completed form. xfa-extract detects XFA, pulls the datasets packet, parses it, and gives you the field → value map.

Read-only. It never writes to or mutates your PDF.

Install

pip install xfa-extract              # core (pypdf + lxml)
pip install "xfa-extract[robust]"    # + pikepdf fallback for unusual PDFs

Use it — command line

xfa-extract FORM.pdf                 # human-readable tree + flat "path: value" table
xfa-extract FORM.pdf --json          # machine-readable JSON (for scripts / LLMs)
xfa-extract FORM.pdf --flatten       # just the path: value table

Every run also writes the raw datasets XML to --raw-out (default ./xfa_datasets.xml) for auditing.

$ xfa-extract application.pdf --flatten
form1.PersonalInfo.Surname           Smith
form1.PersonalInfo.GivenName         Jane
form1.Dependents.Dependent[0].Name   Alex
form1.Dependents.Dependent[1].Name   Sam

Repeating sections (multiple dependents, applicants, addresses) are indexed (Dependent[0], Dependent[1], …), never collapsed.

Use it — as a library

from xfa_extract import locate_datasets, parse_datasets

kind, datasets, _packets, _engine = locate_datasets("FORM.pdf")
if kind == "xfa" and datasets:
    tree, flat = parse_datasets(datasets)
    print(flat["form1.PersonalInfo.Surname"])   # -> "Smith"

Exit codes (the CLI tells you which case you're in)

code meaning what to do
0 XFA data extracted (≥1 non-empty value) use the values
2 not XFA — AcroForm-only or no form use get_fields(); the tool prints those values for you as a convenience
3 XFA but no datasets, or the form is empty/unfilled report "unfilled / no entered data"
4 parse failure the raw XML is still written to --raw-out for inspection

Tested against real forms

Validated on synthetic fixtures (in CI) and 14 real-world XFA forms — IRCC IMM5257 / 1295 / 1344 / 5710 / 5669, a DHL waybill, an Indian MCA MGT-7, an Ontario lease, a US DOL form, French CERFA — plus a real filled Canadian Proof-of-Citizenship application. Repeating sections, namespaces, Adobe's quirky tag serialization, and base64-image-bearing datasets all handled. See docs/xfa-internals.md for the deep dive.

What it does not do

  • Fill / write values into XFA forms — fragile, and the failure mode on a legal document is bad. (Use a dedicated form-filling tool.)
  • Flatten / render XFA to static pages — different operation.
  • OCR — these are digital forms, not scans.

Use it with Claude / Claude Code

This repo also ships an Agent Skill so Claude Code automatically reaches for it when a fillable PDF's values come back blank. Point Claude at skill/.

License

MIT © Ryan Kashtan. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xfa_extract-0.1.0.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xfa_extract-0.1.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file xfa_extract-0.1.0.tar.gz.

File metadata

  • Download URL: xfa_extract-0.1.0.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xfa_extract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e4dde12cc788cd42fe6a35e6740578b2908218523178585cfe7249c0be91dfbf
MD5 12f8a3d1c45236203d247c90b4a0aa55
BLAKE2b-256 ecbdfe057c7d39d88c54b0ad63a6ffe77a8856e293ed4e3af76e67635b7be72d

See more details on using hashes here.

Provenance

The following attestation bundles were made for xfa_extract-0.1.0.tar.gz:

Publisher: release.yml on ryanjkashtan/xfa-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file xfa_extract-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xfa_extract-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xfa_extract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4863b4d320ad4ae7e6589ed60b091ffb75156cee50c9252c11e9c3d6519f360
MD5 4f8bd47de5cdade448503c181484c2b5
BLAKE2b-256 4de325910a020d85cf5be197aef19ef36dc2a7062ab21fa8c045811e5621cd33

See more details on using hashes here.

Provenance

The following attestation bundles were made for xfa_extract-0.1.0-py3-none-any.whl:

Publisher: release.yml on ryanjkashtan/xfa-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page