Skip to main content

Quarantine your imports — configurable content classification pipeline

Project description

Poveglia

Quarantine your imports.

A Python library that provides a configurable pipeline of content classifiers for scanning uploaded files. Virus scanning, explicit content detection, CSAM reporting, zip bomb detection, AI-generated image detection, and more — all through a single async API.

For how the system is built internally, see ARCHITECTURE.md; for the design rationale and tradeoffs, see THEORY.md.

Quick Start

pip install poveglia
import asyncio
from poveglia import classify, Status

result = asyncio.run(classify({
    "url": "s3://my-bucket/uploads/photo.jpg",
    "classifiers": ["virus", "explicit", "csam", "policy"],
    "classifier_config": {
        "explicit": {"api_callable": my_vision_api, "threshold": 0.7},
        "csam": {"api_callable": my_csam_api, "callback": my_csam_reporter},
        "policy": {"max_size_bytes": 50_000_000, "forbidden_mimetypes": ["video/*"]},
    },
    "metadata": {"user_id": "u_123", "upload_id": "up_456"},
}))

if result.status == Status.FORBID:
    reject_upload(result)
elif result.status == Status.REVIEW:
    queue_for_human_review(result)

Or use the sync wrapper:

from poveglia import classify_sync

result = classify_sync({...})

How It Works

Poveglia runs classifiers in series, in the order you specify. Each classifier returns one of four statuses:

Status Meaning Pipeline behavior
allow Content passes Continue to next classifier
review Uncertain — flag for human review Continue to next classifier
forbid Content fails Stop pipeline
mandatory_action Content fails, action required Execute callback, then stop pipeline

The result includes a top-level status (the worst across all classifiers), per-classifier details, any actions taken, and your metadata passed through untouched.

Scoring Mode

If you want all classifiers to run regardless of failures (for ranking rather than gating):

result = await classify({
    ...
    "scoring_mode": True,
})
# result.status is still the worst, but nothing was short-circuited

Bundled Classifiers

Detection

Name What it detects Optional deps
virus Malware via ClamAV poveglia[clamav]
zip_bomb Zip bombs (compression ratio, nesting depth) none
explicit Nudity, gore, violence, suggestive content poveglia[vision]
csam CSAM — returns mandatory_action on high-confidence hits poveglia[vision]
generated AI-generated imagery poveglia[vision]
identifiable Identifiable people (faces) poveglia[vision]
policy File size, MIME type (extension-based) none

Actions

These run in the pipeline like any classifier, but are also available as standalone API calls:

Name What it does Standalone API
reporting Submits reports when classifier scores exceed thresholds poveglia.reporting.submit()
legal_hold Places objects on legal hold in storage poveglia.legal_hold.apply()
metadata Writes classification metadata to object store poveglia.metadata.upload()

The Input Control Structure

{
    # Required
    "url": "s3://bucket/uploads/file.jpg",
    "classifiers": ["virus", "zip_bomb", "explicit", "csam",
                     "identifiable", "reporting", "metadata"],

    # Per-classifier configuration
    "classifier_config": {
        "explicit": {
            "api_callable": my_vision_api,  # async callable
            "threshold": 0.7,               # forbid above this
            "review_threshold": 0.4,        # review above this
        },
        "csam": {
            "api_callable": my_csam_api,
            "callback": my_csam_handler,    # fires on mandatory_action
            "threshold": 0.8,
        },
        "reporting": {
            "triggers": {"csam": 0.8, "explicit": 0.95},
            "handler": my_report_handler,
        },
        "policy": {
            "max_size_bytes": 52428800,
            "forbidden_mimetypes": ["video/*"],
            "allowed_mimetypes": ["image/*"],
        },
        "metadata": {
            "backend": my_metadata_writer,
        },
    },

    # Skip downloading — use a local copy instead
    "local_path": "/tmp/staged/file.jpg",

    # Cap bytes pulled from a remote URL (DoS guard); omit or None for no cap.
    # Exceeding it raises ContentTooLargeError, recorded in result.errors.
    "max_download_bytes": 52428800,

    # Run all classifiers, never short-circuit
    "scoring_mode": False,

    # Where transformation classifiers write output. Exposed to classifiers as
    # content.output_url; a transforming classifier writes there and returns it
    # as ClassifierResult.transformed_url (surfaced on result.transformed_url).
    "output_url": "s3://bucket/transformed/file.jpg",

    # Passed through untouched to the result
    "metadata": {"user_id": "u_123", "upload_id": "up_456"},
}

The classifiers list controls both which classifiers run and in what order. Order matters — classifiers can share results through the blackboard (see below).

The Result Object

result.status             # Status.ALLOW / REVIEW / FORBID / MANDATORY_ACTION
result.is_clean           # True only if status == ALLOW AND errors is empty
result.classifiers        # {"virus": ClassifierResult(...), "explicit": ClassifierResult(...)}
result.actions_taken      # [ActionRecord(classifier="reporting", action="callback", result={...})]
result.errors             # [ErrorRecord(classifier="generated", error="ServiceUnavailable", ...)]
result.transformed_url    # "s3://..." if a transformation classifier produced output
result.metadata           # {"user_id": "u_123"} — your passthrough data

Important: result.status alone is not a "safe to ship" signal. Classifier exceptions are recorded in result.errors and do not raise the aggregate status — a run where every classifier raised yields Status.ALLOW with populated errors. Use result.is_clean as the binary pass/fail predicate, or check result.errors explicitly alongside result.status.

Content Access

Poveglia accesses files through a lazy content resolver. Some classifiers need only the URL (to pass to external APIs); others need the raw bytes or a local file path.

The resolver downloads only when needed, and caches the result — so if three classifiers call .bytes(), the file is downloaded once.

To avoid the download entirely, provide a local_path in the control structure pointing to a locally-staged copy.

Memory footprint

ContentResolver.bytes() holds the full content in memory for the resolver's lifetime. For small uploads (images, documents) this is fine and avoids redundant I/O. For large files (video, archives, disk images) prefer local_path() in your classifier — it materializes a temp file once and hands out paths instead of keeping bytes resident. Classifiers that shell out to external binaries (ClamAV, ffmpeg, etc.) should always use local_path() regardless of size.

The Blackboard

Classifiers can share intermediate results through a shared context dict, avoiding redundant API calls.

For example, if explicit calls a vision API that also returns face detection data, identifiable can reuse it instead of making a second call:

# explicit classifier writes to the blackboard:
context["explicit.faces"] = [{"confidence": 0.85}, ...]

# identifiable classifier checks the blackboard first:
faces = context.get("explicit.faces")
if faces is not None:
    # reuse — no API call needed

Keys follow the convention <classifier_name>.<key>. Classifiers must always work standalone if the blackboard is empty — the optimization is never a hard dependency.

Writing Custom Classifiers

from poveglia import Classifier, ClassifierResult, Status

class MyClassifier(Classifier):
    name = "my_check"

    async def classify(self, content, config, context):
        data = await content.bytes()

        if looks_bad(data):
            return ClassifierResult(
                status=Status.FORBID,
                detail={"reason": "failed my_check"},
            )

        return ClassifierResult(
            status=Status.ALLOW,
            detail={"clean": True},
        )

Register it as an entry point in your package's pyproject.toml:

[project.entry-points."poveglia.classifiers"]
my_check = "my_package.classifiers:MyClassifier"

Then reference it by name: "classifiers": ["virus", "my_check", "policy"].

CSAM Handling

The CSAM classifier returns mandatory_action on high-confidence hits. This means:

  1. The pipeline short-circuits (no further classifiers run)
  2. The callback you provided in classifier_config.csam.callback fires automatically
  3. The callback result is recorded in result.actions_taken

If no callback is configured, the classifier falls back to forbid — the content is still rejected, but no automatic reporting occurs. A warning is emitted on the poveglia.classifiers.csam logger whenever this fallback fires; route that logger at WARNING or above to your alerting channel.

For deployments where missing the callback is a compliance violation (not merely a dev-mode inconvenience), set require_callback: True in the csam config. With that flag on, a high-confidence detection without a callback raises — the misconfiguration lands in result.errors instead of silently rejecting the content.

Poveglia ships a reporting utility (poveglia.reporting.submit()) and a legal hold utility (poveglia.legal_hold.apply()) that you can wire up as callbacks. You are responsible for configuring and using these — Poveglia provides the tools, not the compliance.

Error Handling

If a classifier raises an exception, the pipeline catches it and continues. The error is recorded in result.errors, but it doesn't stop other classifiers from running and doesn't affect the top-level status.

A failed mandatory callback (e.g., a CSAM report that couldn't be submitted) is recorded in result.actions_taken with error detail — surface this loudly so you can retry.

Principle: fail open in the pipeline, fail loud in the results.

The one exception is configuration errors. An unknown classifier name in classifiers is not caught — classify() / classify_sync() raises KeyError before any classifier runs (and before any download), so a typo'd name fails fast rather than silently producing an incomplete result. This is deliberate: a missing classifier is a programming error, not a content verdict.

Installation

# Core + all classifiers (light deps only)
pip install poveglia

# With vision classifier dependencies
pip install poveglia[vision]

# With ClamAV support
pip install poveglia[clamav]

# With object storage support (metadata, legal_hold)
pip install poveglia[storage]

# Everything
pip install poveglia[all]

Requirements

  • Python 3.11+
  • A running ClamAV daemon (for the virus classifier)
  • Vision/CSAM API credentials (for explicit, csam, generated, identifiable)

Development

# Editable install with the dev toolchain
pip install -e '.[dev]'

# Run the test suite (the "integration" marker is reserved for real-service
# tests; none exist yet, so this currently runs everything)
pytest -m "not integration"

# Lint and type-check — the same gates CI enforces
ruff check poveglia tests
mypy poveglia

CI runs lint, type-check, and tests on Python 3.11, 3.12, and 3.13 for every push and pull request; a pip-audit dependency scan runs report-only.

Releasing

Releases publish to PyPI via GitHub Actions OIDC trusted publishing — no API token is stored anywhere. Publishing a GitHub Release triggers .github/workflows/publish.yml, which builds the sdist + wheel and uploads them with attestations.

One-time setup (PyPI side): add a Trusted Publisher for project poveglia → owner Xof, repo poveglia, workflow publish.yml, environment pypi.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poveglia-1.0.0.tar.gz (41.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

poveglia-1.0.0-py3-none-any.whl (41.8 kB view details)

Uploaded Python 3

File details

Details for the file poveglia-1.0.0.tar.gz.

File metadata

  • Download URL: poveglia-1.0.0.tar.gz
  • Upload date:
  • Size: 41.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for poveglia-1.0.0.tar.gz
Algorithm Hash digest
SHA256 29c5b71e403f081f3f6c90ba8d496871f16a135f10b69546ed2857ce81edd951
MD5 94dc5e0968cf34fe99f3a19b38dcb38d
BLAKE2b-256 21e9a613966a3c0e1758b55f7ca5c3ee4487ae1ed5c6dacc56a09703a6a4dffe

See more details on using hashes here.

Provenance

The following attestation bundles were made for poveglia-1.0.0.tar.gz:

Publisher: publish.yml on Xof/poveglia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file poveglia-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: poveglia-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 41.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for poveglia-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 651fd8917ffcc66a99464695ad7117366289e7babcc28eafbb5cc74269849733
MD5 89e5605857eb9bd26bcd632cdd000f7d
BLAKE2b-256 764990dde5c599445055e89513c7e61e40fd21609882e71570260fad80d100b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for poveglia-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Xof/poveglia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page