No project description provided
Project description
Hugging Face Vision
A generic filter that uses Hugging Face Transformers for vision (object detection, image classification, and embedding extraction) across video streams and OpenFilter pipelines. The filter uses one backend per Hugging Face API: each detection_type maps to a specific processor + model API. Each API supports all models on the Hugging Face Hub that are compatible with that API—any model loadable by the same classes will work without code changes.
Supported Hugging Face APIs
We support the following Hugging Face APIs. Each API corresponds to one detection_type; each API supports any model from the Hub that works with that API (examples below are commonly used / tested).
| HF API (processor + model) | detection_type |
Example model IDs |
|---|---|---|
AutoImageProcessor + AutoModelForImageClassification |
image-classification |
google/vit-base-patch16-224, facebook/convnext-tiny-224 |
AutoImageProcessor + AutoModelForObjectDetection |
closed-vocabulary |
PekingU/rtdetr_r50vd, facebook/detr-resnet-50 |
OwlViTProcessor + OwlViTForObjectDetection |
open-vocabulary |
google/owlvit-base-patch32 |
AutoProcessor + AutoModelForZeroShotObjectDetection |
open-vocabulary-grounding |
openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det |
AutoModel / any AutoModelFor* / timm (hook-based) |
embedding |
facebook/dinov2-small, google/vit-base-patch16-224, convnext_tiny.dinov3_lvd1689m (timm) |
Full list and config examples: docs/supported-models.md.
Methods and scripts
| Method | Detection type | Script | Key config |
|---|---|---|---|
| Image classification (ViT, ConvNeXt, etc.) | image-classification |
scripts/image_classification.py |
MODEL_ID, REVISION, VIDEO_PATH, optional TOP_K in .env |
| Closed-vocabulary (DETR, RT-DETR, Conditional DETR) | closed-vocabulary |
scripts/object_detection.py |
MODEL_ID, REVISION, VIDEO_PATH in .env |
| Open-vocabulary (OWL-ViT) | open-vocabulary |
scripts/zero_shot_object_detection.py |
text_labels in code; VIDEO_PATH in .env |
| Open-vocabulary (Grounding DINO) | open-vocabulary-grounding |
scripts/grounding_dino.py |
text_labels in code; VIDEO_PATH in .env |
| Embedding extraction (any model) | embedding |
scripts/generate_exemplars.py (offline) |
MODEL_ID, REVISION in .env |
Output is written to frame.data["meta"] (see Output Structure): for object detection (closed-vocabulary, open-vocabulary, open-vocabulary-grounding), detections (list of {class, rois} with normalized coords) and detection_confidence; for image classification, only detection_type, task, model, and classification (no detections or detection_confidence); for embedding, embedding (feature vector) and optionally min_exemplar_distance (L2 distance to closest exemplar).
Features
- Supported APIs: Five Hugging Face APIs—image classification, closed-vocabulary object detection, OWL-ViT zero-shot, Grounding DINO, and embedding extraction. Each API supports all Hub models compatible with that API (see table above).
- Detection types:
image-classification,closed-vocabulary,open-vocabulary,open-vocabulary-grounding,embeddingvia pluggable backends (one backend per API). - Image classification: Run ViT, ConvNeXt, or any
AutoModelForImageClassificationmodel withmodel_id,revision,top_k; outputclassifications(label, score). - Object detection: Run DETR, RT-DETR, etc. with
model_id,revision,threshold,max_detections; output inframe.data["meta"]withdetections({class, rois}normalized),detection_confidence. - Zero-shot detection: OWL-ViT or Grounding DINO with
text_labels(list of list of str) for open-vocabulary queries. - Embedding extraction: Extract penultimate-layer feature embeddings from any vision model (classification, detection, or feature extractor). Uses PyTorch forward hooks to capture the last representation before the output head, making it model-agnostic. Supports HuggingFace Transformers and timm via the
model_loaderconfig option. Optionally computes minimum L2 distance to exemplar embeddings for similarity-based anomaly detection. - Standardized output: JSON-serializable payload in
frame.data["meta"]: object detection writesdetections,detection_confidence; image classification writes onlydetection_type,task,model, andclassification(no detections or detection_confidence); embedding writesembeddingand optionallymin_exemplar_distancetoframe.data. - Visualization: Optional topic (e.g.
viz) with bounding boxes/labels (detection) or top label + score (classification). - Frame input: OpenFilter convention (
frame.rw_bgr.image); fallback toframe.data[topic]. - Device selection: CPU or CUDA. Model compatibility: Works with dict and object outputs from processors (e.g. RT-DETR, DETR).
Architecture
The filter follows the OpenFilter pattern with three main stages:
Stage Responsibilities
| Stage | Responsibility |
|---|---|
setup() |
Parse and validate configuration; resolve backend by detection_type, load processor and model; set device |
process() |
Core operation: run backend inference on frame images, attach results, optionally produce visualization frame |
shutdown() |
Clean up resources (unload backend/model) when filter stops |
Data Signature
The filter returns processed frames with the following data structure:
Main Frame Data:
- Original frame data preserved (existing
metakeys such asid,ts,src,src_fpsare kept). - Processing results added to
frame.data["meta"]:- Object detection:
detections(list of{ class, rois }normalized [0,1]),detection_confidence,detection_type,task,model. - Image classification: no
detectionsnordetection_confidence. Onlyclassification:{ classes, confidences, architecture, timestamp, filter_id, model_id, revision, top_k }, plusdetection_type,task,model.
- Object detection:
Visualization Topic (when draw_visualization=True):
- A separate frame is published on the configured topic (e.g.
viz). - Image has bounding boxes and labels drawn;
frame.data["meta"]preserves upstream meta and includes either detection fields orclassification(same shape as main).
Installation
# Install with development dependencies
make install
Configuration
-
Create a
.envfile in the project root (or copy fromenv.exampleif present). -
Edit
.envwith your configuration:
# Required: Hugging Face model id (e.g. PekingU/rtdetr_r50vd)
MODEL_ID=PekingU/rtdetr_r50vd
# Required: Model revision (for reproducibility)
REVISION=main
# Required for pipeline script: path to input video
VIDEO_PATH=./filter_example_video.mp4
# Optional: Detection confidence threshold in [0, 1] (default: 0.3)
THRESHOLD=0.3
# Optional: Visualization (default: false)
DRAW_VISUALIZATION=true
# Optional: Webvis port (default: 8010)
PORT=8010
Configuration Matrix
| Variable | Type | Default | Required | Notes |
|---|---|---|---|---|
model_id |
string | — | Yes | Hugging Face model id (e.g. PekingU/rtdetr_r50vd) |
revision |
string | — | Yes | Model revision (reproducibility) |
detection_type |
string | "closed-vocabulary" | No | image-classification, closed-vocabulary, open-vocabulary, open-vocabulary-grounding, or embedding |
top_k |
int | 5 | No | For image-classification: number of top classes to return (1–1000) |
text_labels |
list | — | For zero-shot / grounding | List of list of str, e.g. [["a photo of a cat", "a photo of a dog"]] |
threshold |
float | 0.3 | No | Detection confidence threshold [0, 1] (not used for image-classification) |
device |
string | "cpu" | No | "cpu" or "cuda" / cuda device index |
max_detections |
int | 100 | No | Maximum number of detections per frame (object detection only) |
input_topic |
string | "main" | No | Topic to read frame image from |
output_topic |
string | "main" | No | Topic for processed frame |
draw_visualization |
bool | false | No | Publish a topic with boxes/labels drawn |
visualization_topic |
string | "viz" | No | Topic name for visualization frame |
visualization_alpha |
float | 0.7 | No | Overlay alpha (reserved) |
visualization_source_topic |
string | — | No | Optional source topic for viz image |
| model_loader | string | "transformers" | No | For embedding type: "transformers" or "timm" — how to load the model |
| exemplar_embeddings_path | string | — | No | For embedding type: path to .npz file with reference embeddings |
| output_embeddings | bool | true | No | For embedding type: include raw embedding vector in frame data |
| output_distances | bool | true | No | For embedding type: include min_exemplar_distance (requires exemplars) |
Usage
Use the script that matches your method (see table above). All scripts run VideoIn → FilterHuggingfaceVision → Webvis and serve the UI at http://localhost:PORT (default 8010).
Image classification pipeline
Run image classification with a ViT, ConvNeXt, or any AutoModelForImageClassification model:
# In .env: MODEL_ID (e.g. google/vit-base-patch16-224 or facebook/convnext-tiny-224), REVISION=main, VIDEO_PATH, optional TOP_K
python scripts/image_classification.py
Output: frame.data["meta"] with detection_type, task, model, and classification (architecture, classes, confidences, etc.). No detections or detection_confidence for classification. Visualization shows the top label + score on the image.
Closed-vocabulary (object detection pipeline)
Run the pipeline with a fixed-vocabulary model (DETR, RT-DETR, Conditional DETR):
# Ensure MODEL_ID, REVISION, and VIDEO_PATH are set (e.g. in .env)
python scripts/object_detection.py
This will:
- Load video from
VIDEO_PATH - Run Hugging Face object detection on each frame (
detection_type=closed-vocabulary) - Serve visualization at
http://localhost:8010(orPORT); subscribe tomainandvizwhenDRAW_VISUALIZATIONis enabled
Zero-shot object detection (OWL-ViT)
Run the zero-shot script (model and text_labels are set in the script):
# Set VIDEO_PATH in .env; edit TEXT_LABELS in scripts/zero_shot_object_detection.py if needed
python scripts/zero_shot_object_detection.py
Or use the filter with detection_type="open-vocabulary", model google/owlvit-base-patch32, and text_labels (list of list of str):
from filter_huggingface_vision.filter import FilterHuggingfaceVision, FilterHuggingfaceVisionConfig
FilterHuggingfaceVisionConfig(
...
detection_type="open-vocabulary",
model_id="google/owlvit-base-patch32",
revision="main",
text_labels=[["a photo of a cat", "a photo of a dog"]],
threshold=0.1,
)
Output format is the same: frame.data["meta"] with detections (list of {class, rois} normalized), detection_confidence.
Embedding extraction pipeline
Extract penultimate-layer embeddings from any vision model. Works with classification models, detection models, or pure feature extractors — the backend uses PyTorch forward hooks to capture the last representation before the output head.
# In .env: MODEL_ID, REVISION, VIDEO_PATH
# For exemplar distance: also set EXEMPLAR_EMBEDDINGS_PATH
Or in code:
FilterHuggingfaceVisionConfig(
detection_type="embedding",
model_id="facebook/dinov2-small",
revision="main",
model_loader="transformers", # or "timm" for timm models
exemplar_embeddings_path="./exemplars.npz", # optional
)
Output: frame.data["embedding"] (feature vector) and optionally frame.data["min_exemplar_distance"]. Metadata in frame.data["meta"] with detection_type, task, model.
Generating exemplar embeddings:
# Set in .env: MODEL_ID, REVISION, IMAGE_DIR (directory of reference images)
python scripts/generate_exemplars.py
# Outputs: exemplars.npz (default, or set OUTPUT_PATH)
Grounding DINO pipeline
Run open-vocabulary detection with Grounding DINO (model fixed in script; only VIDEO_PATH required in .env):
# Set VIDEO_PATH in .env (e.g. VIDEO_PATH=./filter_example_video.mp4)
python scripts/grounding_dino.py
See docs/supported-models.md for supported Grounding DINO model IDs and config examples.
Using Makefile
# Run with default pipeline (from Makefile PIPELINE)
make run
# Run unit tests
make test
# Run tests with coverage
make test-coverage
Visualization
When draw_visualization=True, the filter publishes an additional frame on the visualization topic (e.g. viz): bounding boxes and labels for object detection, or top label + score for image classification. Webvis subscribes to both main and viz so you can view results overlaid on the video.
Output Structure
All results are written to frame.data["meta"]. Upstream keys (id, ts, src, src_fps) are preserved; the filter adds or updates:
| Field | Type | Description |
|---|---|---|
detections |
list | Object detection only. Each item: { "class": "<label>", "rois": [[xmin, ymin, xmax, ymax]] } with coordinates normalized in [0, 1]. Not set for image-classification. |
detection_confidence |
float | Object detection only. Mean of detection scores. Not set for image-classification. |
detection_type |
string | Method used: closed-vocabulary, open-vocabulary, open-vocabulary-grounding, or image-classification. |
task |
string | object-detection, zero-shot-object-detection, or image-classification. |
model |
object | { "id": "<model_id>", "revision": "<revision>" } (Hugging Face model). |
classification |
object | Image-classification only. { "classes", "confidences", "architecture", "timestamp", "filter_id", "model_id", "revision", "top_k" }. Classification output has no detections nor detection_confidence. |
Embedding output is written to frame.data (not nested under meta):
| Field | Type | Description |
|---|---|---|
embedding |
list[float] | Feature vector from the penultimate layer. Dimensionality depends on the model. |
min_exemplar_distance |
float | Only when exemplars are loaded. L2 distance to the closest exemplar embedding. |
Object detection example (frame.data["meta"]):
{
"id": 38,
"ts": 1761090922.42,
"src": "file:///path/to/video.mp4",
"src_fps": 25.0,
"detections": [
{ "class": "person", "rois": [[0.12, 0.19, 0.35, 0.46]] }
],
"detection_confidence": 0.95,
"detection_type": "closed-vocabulary",
"task": "object-detection",
"model": { "id": "PekingU/rtdetr_r50vd", "revision": "main" }
}
Image classification (frame.data["meta"]):
{
"id": 38,
"ts": 1761090922.42,
"src": "file:///path/to/video.mp4",
"src_fps": 25.0,
"detection_type": "image-classification",
"task": "image-classification",
"model": { "id": "facebook/convnext-tiny-224", "revision": "main" },
"classification": {
"classes": ["tabby cat", "Egyptian cat"],
"confidences": [0.42, 0.31],
"architecture": "huggingface",
"timestamp": 1761090922.42,
"filter_id": "filter_huggingface_vision",
"model_id": "facebook/convnext-tiny-224",
"revision": "main",
"top_k": 5
}
}
Embedding (frame.data):
{
"meta": {
"id": 38,
"ts": 1761090922.42,
"detection_type": "embedding",
"task": "embedding",
"model": { "id": "facebook/dinov2-small", "revision": "main" }
},
"embedding": [0.0123, -0.0456, 0.0789, "..."],
"min_exemplar_distance": 0.42
}
Development
Project Structure
filter-huggingface-vision/
├── filter_huggingface_vision/
│ ├── filter.py # Main filter implementation
│ └── backends/ # One backend per HF API (image_classification, object_detection, owlvit, grounding_dino, embedding)
├── scripts/
│ ├── image_classification.py
│ ├── object_detection.py
│ ├── zero_shot_object_detection.py
│ ├── grounding_dino.py
│ └── generate_exemplars.py # Offline: generate exemplar embeddings from reference images
├── docs/
│ ├── overview.md
│ ├── object-detection.md
│ └── supported-models.md
├── tests/
└── pyproject.toml
Key Dependencies
openfilter[all]>=0.1.21- Filter frameworktransformers>=4.40.0- Hugging Face APIs (AutoImageProcessor + AutoModelForImageClassification / AutoModelForObjectDetection, OwlViT, AutoModelForZeroShotObjectDetection)torch- Inferencepillow- Image handlinghuggingface-hub- Model loadingpython-dotenv- Environment configuration
Testing
make test
make test-coverage
Troubleshooting
Model or revision errors
- Ensure
MODEL_IDandREVISIONare set. The model must be compatible with the API for yourdetection_type: e.g. forimage-classificationuse a model that loads withAutoModelForImageClassification(ViT, ConvNeXt); forclosed-vocabularyuseAutoModelForObjectDetection(RT-DETR, DETR). See Supported Hugging Face APIs and docs/supported-models.md. - Use a specific revision (e.g.
mainor a commit hash) for reproducibility.
CUDA / device
- Set
deviceto"cpu"if no GPU is available. - For GPU, use
device="cuda"ordevice=0(and ensure PyTorch is built with CUDA). - Official Docker image (
linux/amd64): the publishedplainsightai/openfilter-huggingface-visionimage installs PyTorch CUDA 12.8 (2.9.1+cu128) with a matchingtorchvisionand pip constraint so later dependency resolution cannot bump the stack to CUDA 13. A localpip installfrom PyPI still uses whatever CPU/CUDA wheels you choose; only the Docker build pins CUDA.
No detections in frame
- Check that the input frame provides an image via
frame.rw_bgr.imageorframe.data[input_topic]. - Lower
threshold(e.g. 0.2) to see more detections; increase for fewer false positives.
Visualization not showing
- Set
draw_visualization=Truein the filter config. - Ensure Webvis (or your client) subscribes to both the main topic and the visualization topic (e.g.
viz).
Documentation
For more detail, pipeline examples, variable reference, and supported model IDs per method:
License
See LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filter_huggingface_vision-0.4.5-py3-none-any.whl.
File metadata
- Download URL: filter_huggingface_vision-0.4.5-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21cc7c8ea85904c79d0a007b0dcf64f2a27c05315e0810fb219b02c7b45a60cd
|
|
| MD5 |
cf1d07aeef1ee1ba41060849e9ea1769
|
|
| BLAKE2b-256 |
e0f9d46184e3f27cd28f8515a32627d6c3290aae6064021939550c410555da20
|