Skip to main content

Python SDK for Expunct privacy APIs — PII redaction plus beta document-intelligence workflows for enabled tenants.

Project description

Expunct Python SDK

Privacy infrastructure for modern applications. Detect and redact PII, secrets, and sensitive data before it reaches AI, logs, or external APIs.

PyPI version Python 3.10+ License: MIT

Installation

pip install expunct

Get your API key at expunct.ai — free tier includes 1M tokens/month, no credit card required.

Quick Start

from expunct import Expunct

client = Expunct(api_key="your-api-key")
redacted = client.sanitize_text("Alice Johnson's email is alice@example.com and SSN is 219-09-9999.")
print(redacted)
# Output: PERSON_1's email is EMAIL_ADDRESS_1 and SSN is US_SSN_1.

Usage

Text redaction (sync)

from expunct import Expunct

client = Expunct(api_key="your-api-key")

redacted = client.sanitize_text("Call Bob at 415-555-0100 or bob@example.com")
print(redacted)
# Call PERSON_1 at PHONE_NUMBER_1 or EMAIL_ADDRESS_1

Text redaction (async)

import asyncio
from expunct import AsyncExpunct

async def main():
    async with AsyncExpunct(api_key="your-api-key") as client:
        redacted = await client.sanitize_text("Call Bob at 415-555-0100 or bob@example.com")
        print(redacted)

asyncio.run(main())

File redaction (PDF, DOCX, images, audio)

from expunct import Expunct

client = Expunct(api_key="your-api-key")

# Pass a file path — returns redacted bytes
redacted_bytes = client.sanitize_file("contract.pdf")

# Save directly to disk
client.sanitize_file("contract.pdf", dest="contract_redacted.pdf")

# Pass a file-like object
with open("invoice.docx", "rb") as f:
    redacted_bytes = client.sanitize_file(f)

URI redaction (cloud storage)

Submit a file hosted in cloud storage (S3, GCS, Azure Blob) for redaction. The optional output_uri controls where the redacted file is written; if omitted the result is available via jobs.download().

from expunct import Expunct

client = Expunct(api_key="your-api-key")

job = client.sanitize_uri(
    "s3://my-bucket/reports/q1.pdf",
    output_uri="s3://my-bucket/reports/q1_redacted.pdf",
)
print(job.status)           # "completed"
print(job.findings_count)   # number of PII items found

Batch URI redaction

Enqueue multiple files in one call via the lower-level redact.batch() method, then poll the batch status:

from expunct import Expunct

client = Expunct(api_key="your-api-key")

batch = client.redact.batch(
    input_uris=[
        "s3://my-bucket/docs/file1.pdf",
        "s3://my-bucket/docs/file2.pdf",
    ],
    language="en",
)
print(batch.id, batch.total_jobs)

# Poll progress
status = client.batch.get(batch.id)
print(status.completed_jobs, status.failed_jobs)

Environment variable

Set EXPUNCT_API_KEY to avoid hard-coding the key in code:

import os
from expunct import Expunct

client = Expunct(api_key=os.environ["EXPUNCT_API_KEY"])

Custom policy

Policies let you control which entity types are detected, the redaction method, confidence thresholds, and more. Create a policy once and reference it by ID on every job.

from expunct import Expunct, PolicyCreate

client = Expunct(api_key="your-api-key")

# Create a policy that only redacts PII and uses pseudonymization
policy = client.policies.create(PolicyCreate(
    name="pii-only-pseudonymize",
    pii_categories=["PII"],
    redaction_method="pseudonymization",
    confidence_threshold=0.7,
))

# Use the policy when uploading a file
job = client.redact.file("report.pdf", policy_id=policy.id)
completed = client.wait_for_job(job.id)
redacted_bytes = client.jobs.download(completed.id)

Inspecting findings

Every completed job exposes the PII entities that were found:

from expunct import Expunct

client = Expunct(api_key="your-api-key")

redacted_bytes = client.sanitize_file("form.pdf")

# Re-fetch job detail to inspect findings
jobs = client.jobs.list(page=1, page_size=1)
detail = client.jobs.get(jobs.jobs[0].id)

for finding in detail.findings:
    print(finding.entity_type, finding.confidence, finding.entity_value)

Error handling

from expunct import Expunct, AuthenticationError, RateLimitError, PollingTimeoutError

client = Expunct(api_key="your-api-key")

try:
    redacted = client.sanitize_text("Alice, SSN 219-09-9999")
except AuthenticationError:
    print("Invalid API key")
except RateLimitError as e:
    print(f"Rate limited — retry after {e.retry_after}s")
except PollingTimeoutError as e:
    print(f"Job {e.job_id} timed out after {e.timeout}s")

Context manager (sync)

from expunct import Expunct

with Expunct(api_key="your-api-key") as client:
    redacted = client.sanitize_text("John Smith, DOB 01/01/1980")

Client reference

Expunct / AsyncExpunct

Parameter Type Default Description
api_key str required Your Expunct API key
base_url str https://api.expunct.ai Override for self-hosted or staging
tenant_id str | None None Multi-tenant isolation header
timeout float 30.0 Per-request timeout in seconds
max_retries int 3 Automatic retries on transient errors

Convenience methods

Method Returns Description
sanitize_text(text, *, language) str Redact text in one call (upload → poll → decode)
sanitize_file(file, *, language, dest) bytes Upload a file, poll, return redacted bytes
sanitize_uri(input_uri, *, language, output_uri) JobDetailResponse Submit a URI, poll, return completed job
wait_for_job(job_id, *, interval, timeout) JobDetailResponse Poll a job until it completes or times out

Resource methods

client.redact

Method Returns Description
redact.file(file, *, config, language, policy_id) JobResponse Upload a file and enqueue a redaction job
redact.uri(input_uri, *, output_uri, config, language, metadata) JobResponse Submit a cloud URI for redaction
redact.batch(input_uris, *, config, language, metadata) BatchJobResponse Submit multiple URIs as a batch

client.jobs

Method Returns Description
jobs.list(*, page, page_size, status) JobListResponse List jobs with optional status filter
jobs.get(job_id) JobDetailResponse Get job detail including findings
jobs.report(job_id) dict Get full structured report for a job
jobs.download(job_id, *, dest) bytes Download redacted output; optionally save to dest

client.policies

Method Returns Description
policies.list() list[PolicyResponse] List all policies
policies.create(policy) PolicyResponse Create a new policy
policies.get(policy_id) PolicyResponse Fetch a policy by ID
policies.update(policy_id, policy) PolicyResponse Update a policy
policies.delete(policy_id) None Delete a policy

client.batch

Method Returns Description
batch.get(batch_id) BatchJobResponse Get status of a batch job

client.api_keys

Method Returns Description
api_keys.list() list[ApiKeyResponse] List API keys for your account
api_keys.create(key) ApiKeyCreateResponse Create a new API key
api_keys.revoke(key_id) dict Revoke an API key

client.audit

Method Returns Description
audit.list(*, page, page_size, event_type) AuditListResponse List audit log entries

Detected Entity Types

Expunct detects the following entity types by default (all categories enabled):

PII (Personally Identifiable Information)

Type Example
PERSON John Smith
EMAIL_ADDRESS john@example.com
PHONE_NUMBER 415-555-0100
LOCATION San Francisco, CA
DATE_TIME January 1, 1990
NRP American, French (nationalities, religions, political groups)
ORGANIZATION Acme Corp
URL https://example.com
IP_ADDRESS 192.168.1.1
US_DRIVER_LICENSE D1234567
US_PASSPORT 123456789
US_ITIN 900-70-0000

PCI (Payment Card Industry)

Type Example
CREDIT_CARD 4111 1111 1111 1111
US_BANK_NUMBER 123456789
IBAN_CODE GB29NWBK60161331926819
CRYPTO 1BoatSLRHtKNngkdXEeobR76b53LETtpyT
CVV 123
EXPIRY_DATE 12/26
CARD_HOLDER_NAME J. Smith
PIN_NUMBER 1234
ACCOUNT_NUMBER 000123456789

PHI (Protected Health Information)

Type Example
US_SSN 219-09-9999
MEDICAL_LICENSE A1234567

You can restrict detection to specific types using a RedactConfig or by setting pii_types on a policy:

from expunct import Expunct, RedactConfig

client = Expunct(api_key="your-api-key")

config = RedactConfig(
    pii_types=["PERSON", "EMAIL_ADDRESS", "US_SSN"],
    redaction_method="blur",
    confidence_threshold=0.6,
)
job = client.redact.file("document.pdf", config=config.model_dump())

Exceptions

Exception Raised when
AuthenticationError API key is invalid or expired (401/403)
NotFoundError Job or resource not found (404)
ValidationError Request payload is invalid (422)
RateLimitError Rate limit exceeded after retries (429)
PollingTimeoutError wait_for_job exceeded the timeout
ApiError Base class for all SDK errors

Document Intelligence

Parse and extract structured data from PDFs and DOCX files.

Document Intelligence is currently in beta. parse, extract, and the safe_parse workflow are only available for enabled tenants on supported paid plans, and requests return 403 until the backend feature flags are turned on.

Parse a document

from expunct import Expunct

client = Expunct(api_key="your-api-key")

# Submit for parsing
job = client.documents.parse("contract.pdf", language="en")

# Poll until complete
completed = client.wait_for_document_job(job.id)

# Inspect produced artifacts (canonical_document, markdown_render, chunks_v1)
for artifact in completed.artifacts:
    print(artifact.artifact_kind, artifact.id)

# Fetch artifact metadata, then retrieve its JSON payload
canonical = next(
    artifact for artifact in completed.artifacts if artifact.artifact_kind == "canonical_document"
)
metadata = client.documents.get_artifact(canonical.id)
content = client.documents.get_artifact_content(metadata.id)

Extract structured fields

Provide a JSON Schema to extract specific fields from a document. You can pass an existing parse artifact ID to avoid re-parsing:

from expunct import Expunct

client = Expunct(api_key="your-api-key")

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"},
        "vendor_name": {"type": "string"},
    },
}

# Extract from a file directly
job = client.documents.extract(file="invoice.pdf", schema=schema)
completed = client.wait_for_document_job(job.id)

result = next(
    artifact for artifact in completed.artifacts if artifact.artifact_kind == "extraction_result"
)
data = client.documents.get_artifact_content(result.id)
print(data)
# {"invoice_number": "INV-1042", "total_amount": 3150.00, "vendor_name": "Acme Corp"}

Safe-parse (parse + PII redaction in one step)

from expunct import Expunct

client = Expunct(api_key="your-api-key")

# Parse and sanitize in a single workflow
job = client.documents.safe_parse("patient_notes.pdf", language="en")
completed = client.wait_for_document_job(job.id)

# Artifacts include the PII-sanitized canonical document, markdown, and chunks
for artifact in completed.artifacts:
    print(artifact.artifact_kind, artifact.id)

Async document intelligence

import asyncio
from expunct import AsyncExpunct

async def main():
    async with AsyncExpunct(api_key="your-api-key") as client:
        job = await client.documents.parse("report.pdf")
        completed = await client.wait_for_document_job(job.id)
        canonical = next(
            artifact
            for artifact in completed.artifacts
            if artifact.artifact_kind == "canonical_document"
        )
        content = await client.documents.get_artifact_content(canonical.id)
        print(content)

asyncio.run(main())

client.documents reference

Method Returns Description
documents.parse(file, *, config, language) DocumentJobResponse Submit a PDF/DOCX for parsing
documents.extract(*, file, parse_artifact_id, schema, template_id, config, language) DocumentJobResponse Extract fields from a file or parse artifact
documents.safe_parse(file, *, config, policy_id, language) DocumentJobResponse Parse + redact PII in one step
documents.get_job(job_id) DocumentJobDetailResponse Poll a document job
documents.get_artifact(artifact_id) ArtifactResponse Retrieve artifact metadata
documents.get_artifact_content(artifact_id) dict Retrieve artifact content as JSON
wait_for_document_job(job_id, *, interval, timeout) DocumentJobDetailResponse Poll until complete or timeout

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

expunct-0.2.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

expunct-0.2.0-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file expunct-0.2.0.tar.gz.

File metadata

  • Download URL: expunct-0.2.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for expunct-0.2.0.tar.gz
Algorithm Hash digest
SHA256 56aec9a79cbdd7015002cf54e57d62a89ee6fc073c37642e200b4bfc4c02fbd5
MD5 4c404c3534d6c788da3fea51179fd232
BLAKE2b-256 c56072562551a2f2def2bed954b511d5e1fd51b611c6c9f34a8b3dff3872359e

See more details on using hashes here.

File details

Details for the file expunct-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: expunct-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for expunct-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60e99fc76fbfb61613d3257295802461ee02b89c6f448b09a4b2052a6010cc75
MD5 26fbcf0bd2eabc8c697e27c631c5c825
BLAKE2b-256 df6b25fcac6ec783761183d77d7042e6ad2962cdbba52b4c9463069b199b237d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page