Read entered values out of XFA / LiveCycle 'dynamic' PDF forms (IRCC and other government forms) that standard PDF field extraction misses.
Project description
xfa-extract
Read the entered values out of XFA / LiveCycle "dynamic" PDF forms — the ones where
pypdf.get_fields() comes back empty even though the form is clearly filled in.
If you've ever hit this:
I filled out a government PDF (an IRCC immigration form, a tax form, …), but when I run
PdfReader(...).get_fields()orpdftk dump_data_fields, the values are blank or missing. The form template text extracts fine, but none of the answers show up. Or the PDF just shows "Please wait… If this message is not eventually replaced…".
…then your PDF is an XFA form, and this tool reads it.
Why standard extraction misses the data
A normal interactive PDF (an AcroForm) stores each field's value in its /V entry —
pypdf, pdftk, pdfminer all read those fine.
An XFA form (Adobe LiveCycle / "dynamic" PDF — what most government and immigration forms
are) does not keep the entered data in /V. It keeps it in an XML packet inside the
AcroForm dictionary under the /XFA key, in a sub-packet called datasets. So
get_fields() and text extraction look blank even on a fully completed form. xfa-extract
detects XFA, pulls the datasets packet, parses it, and gives you the field → value map.
Read-only. It never writes to or mutates your PDF.
Install
pip install xfa-extract # core (pypdf + lxml)
pip install "xfa-extract[robust]" # + pikepdf fallback for unusual PDFs
Use it — command line
xfa-extract FORM.pdf # human-readable tree + flat "path: value" table
xfa-extract FORM.pdf --json # machine-readable JSON (for scripts / LLMs)
xfa-extract FORM.pdf --flatten # just the path: value table
Every run also writes the raw datasets XML to --raw-out (default ./xfa_datasets.xml)
for auditing.
$ xfa-extract application.pdf --flatten
form1.PersonalInfo.Surname Smith
form1.PersonalInfo.GivenName Jane
form1.Dependents.Dependent[0].Name Alex
form1.Dependents.Dependent[1].Name Sam
Repeating sections (multiple dependents, applicants, addresses) are indexed
(Dependent[0], Dependent[1], …), never collapsed.
Use it — as a library
from xfa_extract import locate_datasets, parse_datasets
kind, datasets, _packets, _engine = locate_datasets("FORM.pdf")
if kind == "xfa" and datasets:
tree, flat = parse_datasets(datasets)
print(flat["form1.PersonalInfo.Surname"]) # -> "Smith"
Exit codes (the CLI tells you which case you're in)
| code | meaning | what to do |
|---|---|---|
0 |
XFA data extracted (≥1 non-empty value) | use the values |
2 |
not XFA — AcroForm-only or no form | use get_fields(); the tool prints those values for you as a convenience |
3 |
XFA but no datasets, or the form is empty/unfilled |
report "unfilled / no entered data" |
4 |
parse failure | the raw XML is still written to --raw-out for inspection |
Tested against real forms
Validated on synthetic fixtures (in CI) and 14 real-world XFA forms — IRCC IMM5257 /
1295 / 1344 / 5710 / 5669, a DHL waybill, an Indian MCA MGT-7, an Ontario lease, a US DOL
form, French CERFA — plus a real filled Canadian Proof-of-Citizenship application. Repeating
sections, namespaces, Adobe's quirky tag serialization, and base64-image-bearing datasets all
handled. See docs/xfa-internals.md for the deep dive.
What it does not do
- Fill / write values into XFA forms — fragile, and the failure mode on a legal document is bad. (Use a dedicated form-filling tool.)
- Flatten / render XFA to static pages — different operation.
- OCR — these are digital forms, not scans.
Use it with Claude / Claude Code
This repo also ships an Agent Skill so Claude Code automatically reaches
for it when a fillable PDF's values come back blank. Point Claude at skill/.
License
MIT © Ryan Kashtan. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xfa_extract-0.1.0.tar.gz.
File metadata
- Download URL: xfa_extract-0.1.0.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4dde12cc788cd42fe6a35e6740578b2908218523178585cfe7249c0be91dfbf
|
|
| MD5 |
12f8a3d1c45236203d247c90b4a0aa55
|
|
| BLAKE2b-256 |
ecbdfe057c7d39d88c54b0ad63a6ffe77a8856e293ed4e3af76e67635b7be72d
|
Provenance
The following attestation bundles were made for xfa_extract-0.1.0.tar.gz:
Publisher:
release.yml on ryanjkashtan/xfa-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
xfa_extract-0.1.0.tar.gz -
Subject digest:
e4dde12cc788cd42fe6a35e6740578b2908218523178585cfe7249c0be91dfbf - Sigstore transparency entry: 1914866166
- Sigstore integration time:
-
Permalink:
ryanjkashtan/xfa-extract@e1cf773738defda56d2136ceb1ca67fc6b6f8898 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ryanjkashtan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e1cf773738defda56d2136ceb1ca67fc6b6f8898 -
Trigger Event:
release
-
Statement type:
File details
Details for the file xfa_extract-0.1.0-py3-none-any.whl.
File metadata
- Download URL: xfa_extract-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4863b4d320ad4ae7e6589ed60b091ffb75156cee50c9252c11e9c3d6519f360
|
|
| MD5 |
4f8bd47de5cdade448503c181484c2b5
|
|
| BLAKE2b-256 |
4de325910a020d85cf5be197aef19ef36dc2a7062ab21fa8c045811e5621cd33
|
Provenance
The following attestation bundles were made for xfa_extract-0.1.0-py3-none-any.whl:
Publisher:
release.yml on ryanjkashtan/xfa-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
xfa_extract-0.1.0-py3-none-any.whl -
Subject digest:
b4863b4d320ad4ae7e6589ed60b091ffb75156cee50c9252c11e9c3d6519f360 - Sigstore transparency entry: 1914866378
- Sigstore integration time:
-
Permalink:
ryanjkashtan/xfa-extract@e1cf773738defda56d2136ceb1ca67fc6b6f8898 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ryanjkashtan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e1cf773738defda56d2136ceb1ca67fc6b6f8898 -
Trigger Event:
release
-
Statement type: