Quarantine your imports — configurable content classification pipeline
Project description
Poveglia
Quarantine your imports.
A Python library that provides a configurable pipeline of content classifiers for scanning uploaded files. Virus scanning, explicit content detection, CSAM reporting, zip bomb detection, AI-generated image detection, and more — all through a single async API.
For how the system is built internally, see ARCHITECTURE.md; for the design rationale and tradeoffs, see THEORY.md.
Quick Start
pip install poveglia
import asyncio
from poveglia import classify, Status
result = asyncio.run(classify({
"url": "s3://my-bucket/uploads/photo.jpg",
"classifiers": ["virus", "explicit", "csam", "policy"],
"classifier_config": {
"explicit": {"api_callable": my_vision_api, "threshold": 0.7},
"csam": {"api_callable": my_csam_api, "callback": my_csam_reporter},
"policy": {"max_size_bytes": 50_000_000, "forbidden_mimetypes": ["video/*"]},
},
"metadata": {"user_id": "u_123", "upload_id": "up_456"},
}))
if result.status == Status.FORBID:
reject_upload(result)
elif result.status == Status.REVIEW:
queue_for_human_review(result)
Or use the sync wrapper:
from poveglia import classify_sync
result = classify_sync({...})
How It Works
Poveglia runs classifiers in series, in the order you specify. Each classifier returns one of four statuses:
| Status | Meaning | Pipeline behavior |
|---|---|---|
allow |
Content passes | Continue to next classifier |
review |
Uncertain — flag for human review | Continue to next classifier |
forbid |
Content fails | Stop pipeline |
mandatory_action |
Content fails, action required | Execute callback, then stop pipeline |
The result includes a top-level status (the worst across all classifiers), per-classifier details, any actions taken, and your metadata passed through untouched.
Scoring Mode
If you want all classifiers to run regardless of failures (for ranking rather than gating):
result = await classify({
...
"scoring_mode": True,
})
# result.status is still the worst, but nothing was short-circuited
Bundled Classifiers
Detection
| Name | What it detects | Optional deps |
|---|---|---|
virus |
Malware via ClamAV | poveglia[clamav] |
zip_bomb |
Zip bombs (compression ratio, nesting depth) | none |
explicit |
Nudity, gore, violence, suggestive content | poveglia[vision] |
csam |
CSAM — returns mandatory_action on high-confidence hits |
poveglia[vision] |
generated |
AI-generated imagery | poveglia[vision] |
identifiable |
Identifiable people (faces) | poveglia[vision] |
policy |
File size, MIME type (extension-based) | none |
Actions
These run in the pipeline like any classifier, but are also available as standalone API calls:
| Name | What it does | Standalone API |
|---|---|---|
reporting |
Submits reports when classifier scores exceed thresholds | poveglia.reporting.submit() |
legal_hold |
Places objects on legal hold in storage | poveglia.legal_hold.apply() |
metadata |
Writes classification metadata to object store | poveglia.metadata.upload() |
The Input Control Structure
{
# Required
"url": "s3://bucket/uploads/file.jpg",
"classifiers": ["virus", "zip_bomb", "explicit", "csam",
"identifiable", "reporting", "metadata"],
# Per-classifier configuration
"classifier_config": {
"explicit": {
"api_callable": my_vision_api, # async callable
"threshold": 0.7, # forbid above this
"review_threshold": 0.4, # review above this
},
"csam": {
"api_callable": my_csam_api,
"callback": my_csam_handler, # fires on mandatory_action
"threshold": 0.8,
},
"reporting": {
"triggers": {"csam": 0.8, "explicit": 0.95},
"handler": my_report_handler,
},
"policy": {
"max_size_bytes": 52428800,
"forbidden_mimetypes": ["video/*"],
"allowed_mimetypes": ["image/*"],
},
"metadata": {
"backend": my_metadata_writer,
},
},
# Skip downloading — use a local copy instead
"local_path": "/tmp/staged/file.jpg",
# Cap bytes pulled from a remote URL (DoS guard); omit or None for no cap.
# Exceeding it raises ContentTooLargeError, recorded in result.errors.
"max_download_bytes": 52428800,
# Run all classifiers, never short-circuit
"scoring_mode": False,
# Where transformation classifiers write output. Exposed to classifiers as
# content.output_url; a transforming classifier writes there and returns it
# as ClassifierResult.transformed_url (surfaced on result.transformed_url).
"output_url": "s3://bucket/transformed/file.jpg",
# Passed through untouched to the result
"metadata": {"user_id": "u_123", "upload_id": "up_456"},
}
The classifiers list controls both which classifiers run and in what order. Order matters — classifiers can share results through the blackboard (see below).
The Result Object
result.status # Status.ALLOW / REVIEW / FORBID / MANDATORY_ACTION
result.is_clean # True only if status == ALLOW AND errors is empty
result.classifiers # {"virus": ClassifierResult(...), "explicit": ClassifierResult(...)}
result.actions_taken # [ActionRecord(classifier="reporting", action="callback", result={...})]
result.errors # [ErrorRecord(classifier="generated", error="ServiceUnavailable", ...)]
result.transformed_url # "s3://..." if a transformation classifier produced output
result.metadata # {"user_id": "u_123"} — your passthrough data
Important:
result.statusalone is not a "safe to ship" signal. Classifier exceptions are recorded inresult.errorsand do not raise the aggregate status — a run where every classifier raised yieldsStatus.ALLOWwith populatederrors. Useresult.is_cleanas the binary pass/fail predicate, or checkresult.errorsexplicitly alongsideresult.status.
Content Access
Poveglia accesses files through a lazy content resolver. Some classifiers need only the URL (to pass to external APIs); others need the raw bytes or a local file path.
The resolver downloads only when needed, and caches the result — so if three classifiers call .bytes(), the file is downloaded once.
To avoid the download entirely, provide a local_path in the control structure pointing to a locally-staged copy.
Memory footprint
ContentResolver.bytes() holds the full content in memory for the resolver's lifetime. For small uploads (images, documents) this is fine and avoids redundant I/O. For large files (video, archives, disk images) prefer local_path() in your classifier — it materializes a temp file once and hands out paths instead of keeping bytes resident. Classifiers that shell out to external binaries (ClamAV, ffmpeg, etc.) should always use local_path() regardless of size.
The Blackboard
Classifiers can share intermediate results through a shared context dict, avoiding redundant API calls.
For example, if explicit calls a vision API that also returns face detection data, identifiable can reuse it instead of making a second call:
# explicit classifier writes to the blackboard:
context["explicit.faces"] = [{"confidence": 0.85}, ...]
# identifiable classifier checks the blackboard first:
faces = context.get("explicit.faces")
if faces is not None:
# reuse — no API call needed
Keys follow the convention <classifier_name>.<key>. Classifiers must always work standalone if the blackboard is empty — the optimization is never a hard dependency.
Writing Custom Classifiers
from poveglia import Classifier, ClassifierResult, Status
class MyClassifier(Classifier):
name = "my_check"
async def classify(self, content, config, context):
data = await content.bytes()
if looks_bad(data):
return ClassifierResult(
status=Status.FORBID,
detail={"reason": "failed my_check"},
)
return ClassifierResult(
status=Status.ALLOW,
detail={"clean": True},
)
Register it as an entry point in your package's pyproject.toml:
[project.entry-points."poveglia.classifiers"]
my_check = "my_package.classifiers:MyClassifier"
Then reference it by name: "classifiers": ["virus", "my_check", "policy"].
CSAM Handling
The CSAM classifier returns mandatory_action on high-confidence hits. This means:
- The pipeline short-circuits (no further classifiers run)
- The callback you provided in
classifier_config.csam.callbackfires automatically - The callback result is recorded in
result.actions_taken
If no callback is configured, the classifier falls back to forbid — the content is still rejected, but no automatic reporting occurs. A warning is emitted on the poveglia.classifiers.csam logger whenever this fallback fires; route that logger at WARNING or above to your alerting channel.
For deployments where missing the callback is a compliance violation (not merely a dev-mode inconvenience), set require_callback: True in the csam config. With that flag on, a high-confidence detection without a callback raises — the misconfiguration lands in result.errors instead of silently rejecting the content.
Poveglia ships a reporting utility (poveglia.reporting.submit()) and a legal hold utility (poveglia.legal_hold.apply()) that you can wire up as callbacks. You are responsible for configuring and using these — Poveglia provides the tools, not the compliance.
Error Handling
If a classifier raises an exception, the pipeline catches it and continues. The error is recorded in result.errors, but it doesn't stop other classifiers from running and doesn't affect the top-level status.
A failed mandatory callback (e.g., a CSAM report that couldn't be submitted) is recorded in result.actions_taken with error detail — surface this loudly so you can retry.
Principle: fail open in the pipeline, fail loud in the results.
The one exception is configuration errors. An unknown classifier name in classifiers is not caught — classify() / classify_sync() raises KeyError before any classifier runs (and before any download), so a typo'd name fails fast rather than silently producing an incomplete result. This is deliberate: a missing classifier is a programming error, not a content verdict.
Installation
# Core + all classifiers (light deps only)
pip install poveglia
# With vision classifier dependencies
pip install poveglia[vision]
# With ClamAV support
pip install poveglia[clamav]
# With object storage support (metadata, legal_hold)
pip install poveglia[storage]
# Everything
pip install poveglia[all]
Requirements
- Python 3.11+
- A running ClamAV daemon (for the
virusclassifier) - Vision/CSAM API credentials (for
explicit,csam,generated,identifiable)
Development
# Editable install with the dev toolchain
pip install -e '.[dev]'
# Run the test suite (the "integration" marker is reserved for real-service
# tests; none exist yet, so this currently runs everything)
pytest -m "not integration"
# Lint and type-check — the same gates CI enforces
ruff check poveglia tests
mypy poveglia
CI runs lint, type-check, and tests on Python 3.11, 3.12, and 3.13 for every push and pull request; a pip-audit dependency scan runs report-only.
Releasing
Releases publish to PyPI via GitHub Actions OIDC trusted publishing — no API token is stored anywhere. Publishing a GitHub Release triggers .github/workflows/publish.yml, which builds the sdist + wheel and uploads them with attestations.
One-time setup (PyPI side): add a Trusted Publisher for project poveglia → owner Xof, repo poveglia, workflow publish.yml, environment pypi.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file poveglia-1.0.0.tar.gz.
File metadata
- Download URL: poveglia-1.0.0.tar.gz
- Upload date:
- Size: 41.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29c5b71e403f081f3f6c90ba8d496871f16a135f10b69546ed2857ce81edd951
|
|
| MD5 |
94dc5e0968cf34fe99f3a19b38dcb38d
|
|
| BLAKE2b-256 |
21e9a613966a3c0e1758b55f7ca5c3ee4487ae1ed5c6dacc56a09703a6a4dffe
|
Provenance
The following attestation bundles were made for poveglia-1.0.0.tar.gz:
Publisher:
publish.yml on Xof/poveglia
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
poveglia-1.0.0.tar.gz -
Subject digest:
29c5b71e403f081f3f6c90ba8d496871f16a135f10b69546ed2857ce81edd951 - Sigstore transparency entry: 1906107294
- Sigstore integration time:
-
Permalink:
Xof/poveglia@6ef3c64ff343536f92269b032bbba7957ab13bf7 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/Xof
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6ef3c64ff343536f92269b032bbba7957ab13bf7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file poveglia-1.0.0-py3-none-any.whl.
File metadata
- Download URL: poveglia-1.0.0-py3-none-any.whl
- Upload date:
- Size: 41.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
651fd8917ffcc66a99464695ad7117366289e7babcc28eafbb5cc74269849733
|
|
| MD5 |
89e5605857eb9bd26bcd632cdd000f7d
|
|
| BLAKE2b-256 |
764990dde5c599445055e89513c7e61e40fd21609882e71570260fad80d100b9
|
Provenance
The following attestation bundles were made for poveglia-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on Xof/poveglia
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
poveglia-1.0.0-py3-none-any.whl -
Subject digest:
651fd8917ffcc66a99464695ad7117366289e7babcc28eafbb5cc74269849733 - Sigstore transparency entry: 1906107566
- Sigstore integration time:
-
Permalink:
Xof/poveglia@6ef3c64ff343536f92269b032bbba7957ab13bf7 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/Xof
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6ef3c64ff343536f92269b032bbba7957ab13bf7 -
Trigger Event:
release
-
Statement type: