Sanitize user-uploaded PDFs by removing JavaScript, OpenAction, Launch and other active content. Drop-in Python library for web applications that accept PDF uploads.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Kovetz

These details have not been verified by PyPI

Project links

Homepage

Project description

pdf-defang

Strip JavaScript, OpenAction, Launch actions and other active content from PDFs. Lightweight Python library on top of pikepdf. MIT licensed.

📚 Full documentation | 📦 PyPI | 🛠️ Built by kovetz.co.il

Why?

PDFs can carry executable content: JavaScript that runs when the file opens, auto-actions that fire on every page navigation, "Launch" actions that try to open other programs, embedded files that drop malware. If you process user-uploaded PDFs in your app, you should strip this content before serving them back.

The Python ecosystem has parsers (pikepdf, pypdf, PyMuPDF) and a heavy container-based tool (Dangerzone), but no clean drop-in library that says "give me this PDF without active content." This is that library.

Install

pip install pdf-defang

Requires Python 3.9+ and pikepdf 8+.

Quick start

Python API

from pdf_defang import sanitize, scan

# Clean a file in place
sanitize("uploaded.pdf")

# Get a detailed report of what was removed
report = sanitize("uploaded.pdf", return_report=True)
print(report.javascript_in_names)        # 2
print(report.open_action_removed)        # True
print(report.annotation_action_types)    # ['Launch']
print(report.dangerous_uris_removed)     # 1
print(report.as_dict())                  # JSON-serialisable

# Inspect a file WITHOUT modifying it
report = scan("suspicious.pdf")
print(report.risk_level)                 # 'high' / 'medium' / 'low' / 'none'
print(report.has_javascript)             # True

Async API (FastAPI / aiohttp / asyncio)

from pdf_defang import sanitize_async, scan_async

async def handle_upload(path):
    report = await sanitize_async(path, return_report=True)
    return report.as_dict()

In-memory API (S3, Lambda, no disk)

from pdf_defang import sanitize_bytes

raw_pdf: bytes = ...   # from S3, HTTP, anywhere
cleaned: bytes = sanitize_bytes(raw_pdf)
# No disk involved

Encrypted PDFs (encryption preserved on output)

sanitize("encrypted.pdf", password="hunter2")
# Still encrypted with the same password, JavaScript removed.

Two levels: strict (default) vs balanced

# Public uploads: kill everything active (safest)
sanitize("untrusted.pdf")                            # level="strict"

# Trusted internal forms that need Submit / Calculate buttons:
sanitize("expense_form.pdf", level="balanced")

Both levels strip pure attack vectors (/Launch, /GoToR, document JavaScript, dangerous URI schemes, etc.). balanced additionally preserves /SubmitForm / /ResetForm / form JS actions, annotation /AA and /JS triggers, the AcroForm /CO calculation order, and embedded files (used by PDF portfolios). Default is strict.

Command line

# Clean a single file (strict by default)
pdf-defang clean uploaded.pdf

# Clean many at once
pdf-defang clean *.pdf

# Keep form interactivity working
pdf-defang clean --level balanced internal_form.pdf

# Inspect without changes
pdf-defang scan suspicious.pdf

# Get JSON output for piping into your logging stack
pdf-defang scan suspicious.pdf --json | jq .risk_level
pdf-defang clean *.pdf --json > sanitization-log.json

Exit codes follow shell conventions:

Code	`clean`	`scan`
0	All files were already clean	No active content found
1	At least one file had something stripped	Active content detected
2	At least one file could not be opened	File could not be scanned

Use cases

Web app that accepts PDF uploads

from pdf_defang import sanitize

def handle_upload(uploaded_file_path: str) -> str:
    report = sanitize(uploaded_file_path, return_report=True)
    if report.error:
        raise ValueError(f"Could not process PDF: {report.error}")
    # Log what was removed for your audit trail
    logger.info("Sanitized %s: %s", uploaded_file_path, report.as_dict())
    return uploaded_file_path  # safe to serve back to other users now

Suspicious file investigation

from pdf_defang import scan

report = scan("phishing_attachment.pdf")
if report.risk_level == "high":
    quarantine(report)
elif report.risk_level == "medium":
    notify_security_team(report)

Compliance pipeline (PDF/A clean output)

find /var/incoming -name '*.pdf' | xargs pdf-defang clean --json >> audit.jsonl

What gets removed

Item	Where	What it does
`/JavaScript` in `/Names`	Document root	Document-level JavaScript that runs on open
`/EmbeddedFiles`	Document root	Files hidden inside the PDF (potential malware)
`/OpenAction`	Document root	Action automatically executed when PDF opens
`/AA`	Document root	"Additional Actions" - auto-execute on navigation
`/XFA`	`/AcroForm`	Legacy XML forms - well-known attack surface
`/CO`	`/AcroForm`	Form field Calculation Order
`/AA`	Each page	Page-level auto-execute actions
Dangerous `/A`	Each annotation	JavaScript, Launch, ImportData, SubmitForm, ResetForm, Rendition, GoToR, GoToE, Movie, Sound actions
`/AA`	Each annotation	Per-annotation auto-actions
`/JS`	Each annotation	JavaScript attached directly to an annotation
Unsafe `/URI`	Each annotation	URI actions with dangerous schemes (`javascript:`, `file:`, `data:`, `vbscript:`, UNC paths). Standard hyperlinks (`http`, `https`, `mailto`, `tel`, `ftp`, etc.) are preserved.

What is preserved

Sanitization is non-destructive to visible content:

All text, images and layout
Standard form fields (filled values stay intact)
Bookmarks, table of contents, page labels
Document metadata (Author, Title, Subject, Keywords)
Standard link annotations to mailto: / http(s): URLs
Document structure, page count, page order

Why not Dangerzone / iText / commercial SDKs?

Tool	Why this might not fit you
Dangerzone	Excellent for sensitive analyst workflows, but runs a full Docker container per file. Minutes per PDF, not milliseconds.
iText / Apryse	Powerful, but commercial licenses start at thousands of USD/year.
pikepdf directly	Brilliant library, but it's a parser, not a sanitizer. You'd write the same `_strip_document_level()` code we wrote here. That's exactly what we extracted.

pdf-defang is for the case where you want a small, free, drop-in function to ship in your existing Python app. No subprocesses, no Docker, no per-seat license.

Performance

Measured on a Windows 11 laptop, Python 3.13, on the fixture PDFs:

Operation	Median time
`scan_bytes()` on a clean PDF (in memory)	~0.3 ms
`sanitize_bytes()` on a malicious PDF (in memory)	~0.6 ms
`sanitize()` on a clean PDF (with disk I/O)	~8 ms
`sanitize()` kitchen-sink PDF (with disk I/O)	~8 ms

These are 50-100 times faster than container-based tools like Dangerzone (which take seconds-to-minutes per file).

To benchmark on your hardware:

python -m pytest tests/test_performance.py -v -s

Caveats

Sanitization modifies the input file in place. If you need the original preserved for audit, copy it first.
Encrypted PDFs require the password= argument. Wrong-password attempts return an error report (not an exception).
Malformed PDFs may not open at all - we surface the underlying pikepdf error in the report. The original file is not touched on failure.
This is defense in depth, not a replacement for layered controls. Don't rely on a sanitizer alone for high-risk attachment workflows: also validate uploaders, sandbox processing, and scan with AV.

Origin story

This library was originally written for kovetz.co.il (Hebrew PDF tools, www.kovetz.co.il) in May 2026, during an APT scanning campaign by an Iranian-attributed threat actor sweeping endpoints for upload vectors. We needed to make sure that any PDF leaving our service was free of executable payloads, even if an attacker successfully uploaded a poisoned file.

We initially wrote 67 lines of pikepdf code, tested it on the kovetz.co.il fleet (thousands of files/day), then realised there's no clean equivalent in the OSS Python ecosystem. So we extracted it here for everyone else who needs the same thing.

Contributing

Issues and PRs welcome at github.com/kovetz-PDF/pdf-defang.

If you've found a PDF in the wild that contains active content we don't strip, please open an issue with the file (or a minimal reproducer) attached.

Development setup

git clone https://github.com/kovetz-PDF/pdf-defang.git
cd pdf-defang
python -m pip install -e ".[test]"
python -m pytest

The tests/conftest.py will auto-generate the test fixture PDFs on first run.

License

MIT - free for any use, including commercial.

Built and maintained by kovetz.co.il. Contact: contact@kovetz.co.il

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Kovetz

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

May 21, 2026

0.1.1

May 21, 2026

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_defang-0.1.2.tar.gz (30.4 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_defang-0.1.2-py3-none-any.whl (20.5 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file pdf_defang-0.1.2.tar.gz.

File metadata

Download URL: pdf_defang-0.1.2.tar.gz
Upload date: May 21, 2026
Size: 30.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_defang-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`25edf1dff315e0eb1dcb70f8292423949f671c2cdd2a9a470168338965ca2147`
MD5	`2f9702304f56fae4656311ca2b53e857`
BLAKE2b-256	`11c9bf2c82305fff7aebb5a97a48c426b88c45fa49e96b85bf3440c29b33fc56`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf_defang-0.1.2.tar.gz:

Publisher: publish.yml on kovetz-PDF/pdf-defang

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf_defang-0.1.2.tar.gz
- Subject digest: 25edf1dff315e0eb1dcb70f8292423949f671c2cdd2a9a470168338965ca2147
- Sigstore transparency entry: 1591111952
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: kovetz-PDF/pdf-defang@ed5208327d212c6b00135553766f02b3dbf7c137
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/kovetz-PDF
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ed5208327d212c6b00135553766f02b3dbf7c137
- Trigger Event: push

File details

Details for the file pdf_defang-0.1.2-py3-none-any.whl.

File metadata

Download URL: pdf_defang-0.1.2-py3-none-any.whl
Upload date: May 21, 2026
Size: 20.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_defang-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d898947f516f77a0274b36fb73e6de6a2da0d0923c4184fe53a28529b5923996`
MD5	`3c9b2f6eea697488603eb2040d155b7c`
BLAKE2b-256	`9e10c198536e2cfeda53dcc25484a445cdea6b3f8510609a2d7779d6999e18bb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf_defang-0.1.2-py3-none-any.whl:

Publisher: publish.yml on kovetz-PDF/pdf-defang

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf_defang-0.1.2-py3-none-any.whl
- Subject digest: d898947f516f77a0274b36fb73e6de6a2da0d0923c4184fe53a28529b5923996
- Sigstore transparency entry: 1591111987
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: kovetz-PDF/pdf-defang@ed5208327d212c6b00135553766f02b3dbf7c137
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/kovetz-PDF
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ed5208327d212c6b00135553766f02b3dbf7c137
- Trigger Event: push

pdf-defang 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdf-defang

Why?

Install

Quick start

Python API

Async API (FastAPI / aiohttp / asyncio)

In-memory API (S3, Lambda, no disk)

Encrypted PDFs (encryption preserved on output)

Two levels: strict (default) vs balanced

Command line

Use cases

Web app that accepts PDF uploads

Suspicious file investigation

Compliance pipeline (PDF/A clean output)

What gets removed

What is preserved

Why not Dangerzone / iText / commercial SDKs?

Performance

Caveats

Origin story

Contributing

Development setup

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance