No project description provided

These details have not been verified by PyPI

Project description

Hugging Face Vision

A generic filter that uses Hugging Face Transformers for vision (object detection, image classification, and embedding extraction) across video streams and OpenFilter pipelines. The filter uses one backend per Hugging Face API: each detection_type maps to a specific processor + model API. Each API supports all models on the Hugging Face Hub that are compatible with that API—any model loadable by the same classes will work without code changes.

Supported Hugging Face APIs

We support the following Hugging Face APIs. Each API corresponds to one detection_type; each API supports any model from the Hub that works with that API (examples below are commonly used / tested).

HF API (processor + model)	`detection_type`	Example model IDs
`AutoImageProcessor` + `AutoModelForImageClassification`	`image-classification`	`google/vit-base-patch16-224`, `facebook/convnext-tiny-224`
`AutoImageProcessor` + `AutoModelForObjectDetection`	`closed-vocabulary`	`PekingU/rtdetr_r50vd`, `facebook/detr-resnet-50`
`OwlViTProcessor` + `OwlViTForObjectDetection`	`open-vocabulary`	`google/owlvit-base-patch32`
`AutoProcessor` + `AutoModelForZeroShotObjectDetection`	`open-vocabulary-grounding`	`openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det`
`AutoModel` / any `AutoModelFor*` / timm (hook-based)	`embedding`	`facebook/dinov2-small`, `google/vit-base-patch16-224`, `convnext_tiny.dinov3_lvd1689m` (timm)

Full list and config examples: docs/supported-models.md.

Methods and scripts

Method	Detection type	Script	Key config
Image classification (ViT, ConvNeXt, etc.)	`image-classification`	`scripts/image_classification.py`	`MODEL_ID`, `REVISION`, `VIDEO_PATH`, optional `TOP_K` in `.env`
Closed-vocabulary (DETR, RT-DETR, Conditional DETR)	`closed-vocabulary`	`scripts/object_detection.py`	`MODEL_ID`, `REVISION`, `VIDEO_PATH` in `.env`
Open-vocabulary (OWL-ViT)	`open-vocabulary`	`scripts/zero_shot_object_detection.py`	`text_labels` in code; `VIDEO_PATH` in `.env`
Open-vocabulary (Grounding DINO)	`open-vocabulary-grounding`	`scripts/grounding_dino.py`	`text_labels` in code; `VIDEO_PATH` in `.env`
Embedding extraction (any model)	`embedding`	`scripts/generate_exemplars.py` (offline)	`MODEL_ID`, `REVISION` in `.env`

Output is written to frame.data["meta"] (see Output Structure): for object detection (closed-vocabulary, open-vocabulary, open-vocabulary-grounding), detections (list of {class, rois} with normalized coords) and detection_confidence; for image classification, only detection_type, task, model, and classification (no detections or detection_confidence); for embedding, embedding (feature vector) and optionally min_exemplar_distance (L2 distance to closest exemplar).

Features

Supported APIs: Five Hugging Face APIs—image classification, closed-vocabulary object detection, OWL-ViT zero-shot, Grounding DINO, and embedding extraction. Each API supports all Hub models compatible with that API (see table above).
Detection types: image-classification, closed-vocabulary, open-vocabulary, open-vocabulary-grounding, embedding via pluggable backends (one backend per API).
Image classification: Run ViT, ConvNeXt, or any AutoModelForImageClassification model with model_id, revision, top_k; output classifications (label, score).
Object detection: Run DETR, RT-DETR, etc. with model_id, revision, threshold, max_detections; output in frame.data["meta"] with detections ({class, rois} normalized), detection_confidence.
Zero-shot detection: OWL-ViT or Grounding DINO with text_labels (list of list of str) for open-vocabulary queries.
Embedding extraction: Extract penultimate-layer feature embeddings from any vision model (classification, detection, or feature extractor). Uses PyTorch forward hooks to capture the last representation before the output head, making it model-agnostic. Supports HuggingFace Transformers and timm via the model_loader config option. Optionally computes minimum L2 distance to exemplar embeddings for similarity-based anomaly detection.
Standardized output: JSON-serializable payload in frame.data["meta"]: object detection writes detections, detection_confidence; image classification writes only detection_type, task, model, and classification (no detections or detection_confidence); embedding writes embedding and optionally min_exemplar_distance to frame.data.
Visualization: Optional topic (e.g. viz) with bounding boxes/labels (detection) or top label + score (classification).
Frame input: OpenFilter convention (frame.rw_bgr.image); fallback to frame.data[topic].
Device selection: CPU or CUDA. Model compatibility: Works with dict and object outputs from processors (e.g. RT-DETR, DETR).

Architecture

The filter follows the OpenFilter pattern with three main stages:

Stage Responsibilities

Stage	Responsibility
`setup()`	Parse and validate configuration; resolve backend by `detection_type`, load processor and model; set device
`process()`	Core operation: run backend inference on frame images, attach results, optionally produce visualization frame
`shutdown()`	Clean up resources (unload backend/model) when filter stops

Data Signature

The filter returns processed frames with the following data structure:

Main Frame Data:

Original frame data preserved (existing meta keys such as id, ts, src, src_fps are kept).
Processing results added to frame.data["meta"]:
- Object detection: detections (list of { class, rois } normalized [0,1]), detection_confidence, detection_type, task, model.
- Image classification: no detections nor detection_confidence. Only classification: { classes, confidences, architecture, timestamp, filter_id, model_id, revision, top_k }, plus detection_type, task, model.

Visualization Topic (when draw_visualization=True):

A separate frame is published on the configured topic (e.g. viz).
Image has bounding boxes and labels drawn; frame.data["meta"] preserves upstream meta and includes either detection fields or classification (same shape as main).

Installation

# Install with development dependencies
make install

Configuration

Create a .env file in the project root (or copy from env.example if present).
Edit .env with your configuration:

# Required: Hugging Face model id (e.g. PekingU/rtdetr_r50vd)
MODEL_ID=PekingU/rtdetr_r50vd

# Required: Model revision (for reproducibility)
REVISION=main

# Required for pipeline script: path to input video
VIDEO_PATH=./filter_example_video.mp4

# Optional: Detection confidence threshold in [0, 1] (default: 0.3)
THRESHOLD=0.3

# Optional: Visualization (default: false)
DRAW_VISUALIZATION=true

# Optional: Webvis port (default: 8010)
PORT=8010

Configuration Matrix

Variable	Type	Default	Required	Notes
`model_id`	string	—	Yes	Hugging Face model id (e.g. PekingU/rtdetr_r50vd)
`revision`	string	—	Yes	Model revision (reproducibility)
`detection_type`	string	"closed-vocabulary"	No	`image-classification`, `closed-vocabulary`, `open-vocabulary`, `open-vocabulary-grounding`, or `embedding`
`top_k`	int	5	No	For image-classification: number of top classes to return (1–1000)
`text_labels`	list	—	For zero-shot / grounding	List of list of str, e.g. `[["a photo of a cat", "a photo of a dog"]]`
`threshold`	float	0.3	No	Detection confidence threshold [0, 1] (not used for image-classification)
`device`	string	"cpu"	No	"cpu" or "cuda" / cuda device index
`max_detections`	int	100	No	Maximum number of detections per frame (object detection only)
`input_topic`	string	"main"	No	Topic to read frame image from
`output_topic`	string	"main"	No	Topic for processed frame
`draw_visualization`	bool	false	No	Publish a topic with boxes/labels drawn
`visualization_topic`	string	"viz"	No	Topic name for visualization frame
`visualization_alpha`	float	0.7	No	Overlay alpha (reserved)
`visualization_source_topic`	string	—	No	Optional source topic for viz image

Usage

Use the script that matches your method (see table above). All scripts run VideoIn → FilterHuggingfaceVision → Webvis and serve the UI at http://localhost:PORT (default 8010).

Image classification pipeline

Run image classification with a ViT, ConvNeXt, or any AutoModelForImageClassification model:

# In .env: MODEL_ID (e.g. google/vit-base-patch16-224 or facebook/convnext-tiny-224), REVISION=main, VIDEO_PATH, optional TOP_K
python scripts/image_classification.py

Output: frame.data["meta"] with detection_type, task, model, and classification (architecture, classes, confidences, etc.). No detections or detection_confidence for classification. Visualization shows the top label + score on the image.

Closed-vocabulary (object detection pipeline)

Run the pipeline with a fixed-vocabulary model (DETR, RT-DETR, Conditional DETR):

# Ensure MODEL_ID, REVISION, and VIDEO_PATH are set (e.g. in .env)
python scripts/object_detection.py

This will:

Load video from VIDEO_PATH
Run Hugging Face object detection on each frame (detection_type=closed-vocabulary)
Serve visualization at http://localhost:8010 (or PORT); subscribe to main and viz when DRAW_VISUALIZATION is enabled

Zero-shot object detection (OWL-ViT)

Run the zero-shot script (model and text_labels are set in the script):

# Set VIDEO_PATH in .env; edit TEXT_LABELS in scripts/zero_shot_object_detection.py if needed
python scripts/zero_shot_object_detection.py

Or use the filter with detection_type="open-vocabulary", model google/owlvit-base-patch32, and text_labels (list of list of str):

from filter_huggingface_vision.filter import FilterHuggingfaceVision, FilterHuggingfaceVisionConfig

FilterHuggingfaceVisionConfig(
    ...
    detection_type="open-vocabulary",
    model_id="google/owlvit-base-patch32",
    revision="main",
    text_labels=[["a photo of a cat", "a photo of a dog"]],
    threshold=0.1,
)

Output format is the same: frame.data["meta"] with detections (list of {class, rois} normalized), detection_confidence.

Embedding extraction pipeline

Extract penultimate-layer embeddings from any vision model. Works with classification models, detection models, or pure feature extractors — the backend uses PyTorch forward hooks to capture the last representation before the output head.

# In .env: MODEL_ID, REVISION, VIDEO_PATH
# For exemplar distance: also set EXEMPLAR_EMBEDDINGS_PATH

Or in code:

FilterHuggingfaceVisionConfig(
    detection_type="embedding",
    model_id="facebook/dinov2-small",
    revision="main",
    model_loader="transformers",  # or "timm" for timm models
    exemplar_embeddings_path="./exemplars.npz",  # optional
)

Output: frame.data["embedding"] (feature vector) and optionally frame.data["min_exemplar_distance"]. Metadata in frame.data["meta"] with detection_type, task, model.

Generating exemplar embeddings:

# Set in .env: MODEL_ID, REVISION, IMAGE_DIR (directory of reference images)
python scripts/generate_exemplars.py
# Outputs: exemplars.npz (default, or set OUTPUT_PATH)

Grounding DINO pipeline

Run open-vocabulary detection with Grounding DINO (model fixed in script; only VIDEO_PATH required in .env):

# Set VIDEO_PATH in .env (e.g. VIDEO_PATH=./filter_example_video.mp4)
python scripts/grounding_dino.py

See docs/supported-models.md for supported Grounding DINO model IDs and config examples.

Using Makefile

# Run with default pipeline (from Makefile PIPELINE)
make run

# Run unit tests
make test

# Run tests with coverage
make test-coverage

Visualization

When draw_visualization=True, the filter publishes an additional frame on the visualization topic (e.g. viz): bounding boxes and labels for object detection, or top label + score for image classification. Webvis subscribes to both main and viz so you can view results overlaid on the video.

Output Structure

All results are written to frame.data["meta"]. Upstream keys (id, ts, src, src_fps) are preserved; the filter adds or updates:

Field	Type	Description
`detections`	list	Object detection only. Each item: `{ "class": "<label>", "rois": [[xmin, ymin, xmax, ymax]] }` with coordinates normalized in [0, 1]. Not set for image-classification.
`detection_confidence`	float	Object detection only. Mean of detection scores. Not set for image-classification.
`detection_type`	string	Method used: `closed-vocabulary`, `open-vocabulary`, `open-vocabulary-grounding`, or `image-classification`.
`task`	string	`object-detection`, `zero-shot-object-detection`, or `image-classification`.
`model`	object	`{ "id": "<model_id>", "revision": "<revision>" }` (Hugging Face model).
`classification`	object	Image-classification only. `{ "classes", "confidences", "architecture", "timestamp", "filter_id", "model_id", "revision", "top_k" }`. Classification output has no `detections` nor `detection_confidence`.

Embedding output is written to frame.data (not nested under meta):

Field	Type	Description
`embedding`	list[float]	Feature vector from the penultimate layer. Dimensionality depends on the model.
`min_exemplar_distance`	float	Only when exemplars are loaded. L2 distance to the closest exemplar embedding.

Object detection example (frame.data["meta"]):

{
  "id": 38,
  "ts": 1761090922.42,
  "src": "file:///path/to/video.mp4",
  "src_fps": 25.0,
  "detections": [
    { "class": "person", "rois": [[0.12, 0.19, 0.35, 0.46]] }
  ],
  "detection_confidence": 0.95,
  "detection_type": "closed-vocabulary",
  "task": "object-detection",
  "model": { "id": "PekingU/rtdetr_r50vd", "revision": "main" }
}

Image classification (frame.data["meta"]):

{
  "id": 38,
  "ts": 1761090922.42,
  "src": "file:///path/to/video.mp4",
  "src_fps": 25.0,
  "detection_type": "image-classification",
  "task": "image-classification",
  "model": { "id": "facebook/convnext-tiny-224", "revision": "main" },
  "classification": {
    "classes": ["tabby cat", "Egyptian cat"],
    "confidences": [0.42, 0.31],
    "architecture": "huggingface",
    "timestamp": 1761090922.42,
    "filter_id": "filter_huggingface_vision",
    "model_id": "facebook/convnext-tiny-224",
    "revision": "main",
    "top_k": 5
  }
}

Embedding (frame.data):

{
  "meta": {
    "id": 38,
    "ts": 1761090922.42,
    "detection_type": "embedding",
    "task": "embedding",
    "model": { "id": "facebook/dinov2-small", "revision": "main" }
  },
  "embedding": [0.0123, -0.0456, 0.0789, "..."],
  "min_exemplar_distance": 0.42
}

Development

Project Structure

filter-huggingface-vision/
├── filter_huggingface_vision/
│   ├── filter.py              # Main filter implementation
│   └── backends/              # One backend per HF API (image_classification, object_detection, owlvit, grounding_dino, embedding)
├── scripts/
│   ├── image_classification.py
│   ├── object_detection.py
│   ├── zero_shot_object_detection.py
│   ├── grounding_dino.py
│   └── generate_exemplars.py  # Offline: generate exemplar embeddings from reference images
├── docs/
│   ├── overview.md
│   ├── object-detection.md
│   └── supported-models.md
├── tests/
└── pyproject.toml

Key Dependencies

openfilter[all]>=0.1.21 - Filter framework
transformers>=4.40.0 - Hugging Face APIs (AutoImageProcessor + AutoModelForImageClassification / AutoModelForObjectDetection, OwlViT, AutoModelForZeroShotObjectDetection)
torch - Inference
pillow - Image handling
huggingface-hub - Model loading
python-dotenv - Environment configuration

Testing

make test
make test-coverage

Troubleshooting

Model or revision errors

Ensure MODEL_ID and REVISION are set. The model must be compatible with the API for your detection_type: e.g. for image-classification use a model that loads with AutoModelForImageClassification (ViT, ConvNeXt); for closed-vocabulary use AutoModelForObjectDetection (RT-DETR, DETR). See Supported Hugging Face APIs and docs/supported-models.md.
Use a specific revision (e.g. main or a commit hash) for reproducibility.

CUDA / device

Set device to "cpu" if no GPU is available.
For GPU, use device="cuda" or device=0 (and ensure PyTorch is built with CUDA).
Official Docker image (linux/amd64): the published plainsightai/openfilter-huggingface-vision image installs PyTorch CUDA 12.8 (2.9.1+cu128) with a matching torchvision and pip constraint so later dependency resolution cannot bump the stack to CUDA 13. A local pip install from PyPI still uses whatever CPU/CUDA wheels you choose; only the Docker build pins CUDA.

No detections in frame

Check that the input frame provides an image via frame.rw_bgr.image or frame.data[input_topic].
Lower threshold (e.g. 0.2) to see more detections; increase for fewer false positives.

Visualization not showing

Set draw_visualization=True in the filter config.
Ensure Webvis (or your client) subscribes to both the main topic and the visualization topic (e.g. viz).

Documentation

For more detail, pipeline examples, variable reference, and supported model IDs per method:

License

See LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.6

May 4, 2026

This version

0.4.5

Apr 24, 2026

0.4.2

Apr 16, 2026

0.4.1

Apr 3, 2026

0.4.0

Apr 2, 2026

0.3.2

Mar 14, 2026

0.3.1

Mar 14, 2026

0.3.0

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filter_huggingface_vision-0.4.5-py3-none-any.whl (30.6 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file filter_huggingface_vision-0.4.5-py3-none-any.whl.

File metadata

Download URL: filter_huggingface_vision-0.4.5-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 30.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for filter_huggingface_vision-0.4.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`21cc7c8ea85904c79d0a007b0dcf64f2a27c05315e0810fb219b02c7b45a60cd`
MD5	`cf1d07aeef1ee1ba41060849e9ea1769`
BLAKE2b-256	`e0f9d46184e3f27cd28f8515a32627d6c3290aae6064021939550c410555da20`

See more details on using hashes here.

filter-huggingface-vision 0.4.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers