Read entered values out of XFA / LiveCycle 'dynamic' PDF forms (IRCC and other government forms) that standard PDF field extraction misses.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ryanjkashtan

These details have not been verified by PyPI

Project description

xfa-extract

Read the entered values out of XFA / LiveCycle "dynamic" PDF forms — the ones where pypdf.get_fields() comes back empty even though the form is clearly filled in.

If you've ever hit this:

I filled out a government PDF (an IRCC immigration form, a tax form, …), but when I run PdfReader(...).get_fields() or pdftk dump_data_fields, the values are blank or missing. The form template text extracts fine, but none of the answers show up. Or the PDF just shows "Please wait… If this message is not eventually replaced…".

…then your PDF is an XFA form, and this tool reads it.

Why standard extraction misses the data

A normal interactive PDF (an AcroForm) stores each field's value in its /V entry — pypdf, pdftk, pdfminer all read those fine.

An XFA form (Adobe LiveCycle / "dynamic" PDF — what most government and immigration forms are) does not keep the entered data in /V. It keeps it in an XML packet inside the AcroForm dictionary under the /XFA key, in a sub-packet called datasets. So get_fields() and text extraction look blank even on a fully completed form. xfa-extract detects XFA, pulls the datasets packet, parses it, and gives you the field → value map.

Read-only. It never writes to or mutates your PDF.

Install

pip install xfa-extract              # core (pypdf + lxml)
pip install "xfa-extract[robust]"    # + pikepdf fallback for unusual PDFs

Use it — command line

xfa-extract FORM.pdf                 # human-readable tree + flat "path: value" table
xfa-extract FORM.pdf --json          # machine-readable JSON (for scripts / LLMs)
xfa-extract FORM.pdf --flatten       # just the path: value table

Every run also writes the raw datasets XML to --raw-out (default ./xfa_datasets.xml) for auditing.

$ xfa-extract application.pdf --flatten
form1.PersonalInfo.Surname           Smith
form1.PersonalInfo.GivenName         Jane
form1.Dependents.Dependent[0].Name   Alex
form1.Dependents.Dependent[1].Name   Sam

Repeating sections (multiple dependents, applicants, addresses) are indexed (Dependent[0], Dependent[1], …), never collapsed.

Use it — as a library

from xfa_extract import locate_datasets, parse_datasets

kind, datasets, _packets, _engine = locate_datasets("FORM.pdf")
if kind == "xfa" and datasets:
    tree, flat = parse_datasets(datasets)
    print(flat["form1.PersonalInfo.Surname"])   # -> "Smith"

Understand the form's schema, not just its values

XFA forms also carry a template packet — the form's intelligence. parse_template() turns it into a per-field schema: field kind, the human caption, a dropdown/radio's valid values (export code ↔ display label), the expected format, and whether the field runs scripts:

from xfa_extract import parse_template, schema_for

schema = parse_template("FORM.pdf")
f = schema_for(schema, "form1.PersonalInfo.Country")
f.kind        # "choice"
f.caption     # "Country of birth or territory"
f.choices     # [("1", "Canada"), ("2", "Other")]  — datasets stores the export code
f.picture     # e.g. "date{YYYY-MM-DD}" on date fields
f.scripted    # True if the field has calculate/validate/event scripts

This is what lets a filler (see xfa-fill) accept "Canada" and write the "1" the form actually stores.

Exit codes (the CLI tells you which case you're in)

code	meaning	what to do
`0`	XFA data extracted (≥1 non-empty value)	use the values
`2`	not XFA — AcroForm-only or no form	use `get_fields()`; the tool prints those values for you as a convenience
`3`	XFA but no `datasets`, or the form is empty/unfilled	report "unfilled / no entered data"
`4`	parse failure	the raw XML is still written to `--raw-out` for inspection

Tested against real forms

Validated on synthetic fixtures (in CI) and 14 real-world XFA forms — IRCC IMM5257 / 1295 / 1344 / 5710 / 5669, a DHL waybill, an Indian MCA MGT-7, an Ontario lease, a US DOL form, French CERFA — plus a real filled Canadian Proof-of-Citizenship application. Repeating sections, namespaces, Adobe's quirky tag serialization, and base64-image-bearing datasets all handled. See skill/REFERENCE.md for the deep dive.

What it does not do

Fill / write values into XFA forms — that's the job of the companion package xfa-fill, which uses this package's template schema and read-back verification.
Flatten / render XFA to static pages — different operation.
OCR — these are digital forms, not scans.

Use it with Claude / Claude Code

This repo also ships an Agent Skill so Claude Code automatically reaches for it when a fillable PDF's values come back blank. Point Claude at skill/.

License

MIT © Ryan Kashtan. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ryanjkashtan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jul 3, 2026

0.1.0

Jun 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xfa_extract-0.2.0.tar.gz (17.6 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xfa_extract-0.2.0-py3-none-any.whl (14.4 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file xfa_extract-0.2.0.tar.gz.

File metadata

Download URL: xfa_extract-0.2.0.tar.gz
Upload date: Jul 3, 2026
Size: 17.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xfa_extract-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`2e424c3e23b47447304d0f538b762b01830d50dd35ac1837a05a5791e0fefc39`
MD5	`786216380dd8a68a7a216ad572585655`
BLAKE2b-256	`85aaa5e85eb70d05cf73a0b38332b57f00eb272c08d811845933180ce8091a3c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for xfa_extract-0.2.0.tar.gz:

Publisher: release.yml on ryanjkashtan/xfa-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: xfa_extract-0.2.0.tar.gz
- Subject digest: 2e424c3e23b47447304d0f538b762b01830d50dd35ac1837a05a5791e0fefc39
- Sigstore transparency entry: 2051844388
- Sigstore integration time: Jul 3, 2026
Source repository:
- Permalink: ryanjkashtan/xfa-extract@a433f1b61ff2ef5a92791235a99fad3743ab18e8
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ryanjkashtan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a433f1b61ff2ef5a92791235a99fad3743ab18e8
- Trigger Event: release

File details

Details for the file xfa_extract-0.2.0-py3-none-any.whl.

File metadata

Download URL: xfa_extract-0.2.0-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xfa_extract-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e23c9cecf75b8fd4f57b18bf919ea43a7f482bdbfce97ee0cc36fb0489161deb`
MD5	`1c19b0caca91be9dc6c5ed4ae2be27ec`
BLAKE2b-256	`4a816a119be3cfd1b22526edd46a5ee4c6ca2f1f61b1097440809bab9f777f50`

See more details on using hashes here.

Provenance

The following attestation bundles were made for xfa_extract-0.2.0-py3-none-any.whl:

Publisher: release.yml on ryanjkashtan/xfa-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: xfa_extract-0.2.0-py3-none-any.whl
- Subject digest: e23c9cecf75b8fd4f57b18bf919ea43a7f482bdbfce97ee0cc36fb0489161deb
- Sigstore transparency entry: 2051844562
- Sigstore integration time: Jul 3, 2026
Source repository:
- Permalink: ryanjkashtan/xfa-extract@a433f1b61ff2ef5a92791235a99fad3743ab18e8
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ryanjkashtan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a433f1b61ff2ef5a92791235a99fad3743ab18e8
- Trigger Event: release

xfa-extract 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

xfa-extract

Why standard extraction misses the data

Install

Use it — command line

Use it — as a library

Understand the form's schema, not just its values

Exit codes (the CLI tells you which case you're in)

Tested against real forms

What it does not do

Use it with Claude / Claude Code

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance