Skip to main content

No project description provided

Project description

Hugging Face Vision

PyPI version Docker Version License: Apache 2.0

A generic filter that uses Hugging Face Transformers for vision (object detection and image classification) across video streams and OpenFilter pipelines. The filter uses one backend per Hugging Face API: each detection_type maps to a specific processor + model API. Each API supports all models on the Hugging Face Hub that are compatible with that API—any model loadable by the same classes will work without code changes.

Supported Hugging Face APIs

We support the following Hugging Face APIs. Each API corresponds to one detection_type; each API supports any model from the Hub that works with that API (examples below are commonly used / tested).

HF API (processor + model) detection_type Example model IDs
AutoImageProcessor + AutoModelForImageClassification image-classification google/vit-base-patch16-224, facebook/convnext-tiny-224
AutoImageProcessor + AutoModelForObjectDetection closed-vocabulary PekingU/rtdetr_r50vd, facebook/detr-resnet-50
OwlViTProcessor + OwlViTForObjectDetection open-vocabulary google/owlvit-base-patch32
AutoProcessor + AutoModelForZeroShotObjectDetection open-vocabulary-grounding openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det

Full list and config examples: docs/supported-models.md.

Methods and scripts

Method Detection type Script Key config
Image classification (ViT, ConvNeXt, etc.) image-classification scripts/image_classification.py MODEL_ID, REVISION, VIDEO_PATH, optional TOP_K in .env
Closed-vocabulary (DETR, RT-DETR, Conditional DETR) closed-vocabulary scripts/object_detection.py MODEL_ID, REVISION, VIDEO_PATH in .env
Open-vocabulary (OWL-ViT) open-vocabulary scripts/zero_shot_object_detection.py text_labels in code; VIDEO_PATH in .env
Open-vocabulary (Grounding DINO) open-vocabulary-grounding scripts/grounding_dino.py text_labels in code; VIDEO_PATH in .env

Output is written to frame.data["meta"] (see Output Structure): for object detection (closed-vocabulary, open-vocabulary, open-vocabulary-grounding), detections (list of {class, rois} with normalized coords) and detection_confidence; for image classification, only detection_type, task, model, and classification (no detections or detection_confidence).

Features

  • Supported APIs: Four Hugging Face APIs—image classification, closed-vocabulary object detection, OWL-ViT zero-shot, Grounding DINO. Each API supports all Hub models compatible with that API (see table above).
  • Detection types: image-classification, closed-vocabulary, open-vocabulary, open-vocabulary-grounding via pluggable backends (one backend per API).
  • Image classification: Run ViT, ConvNeXt, or any AutoModelForImageClassification model with model_id, revision, top_k; output classifications (label, score).
  • Object detection: Run DETR, RT-DETR, etc. with model_id, revision, threshold, max_detections; output in frame.data["meta"] with detections ({class, rois} normalized), detection_confidence.
  • Zero-shot detection: OWL-ViT or Grounding DINO with text_labels (list of list of str) for open-vocabulary queries.
  • Standardized output: JSON-serializable payload in frame.data["meta"]: object detection writes detections, detection_confidence; image classification writes only detection_type, task, model, and classification (no detections or detection_confidence).
  • Visualization: Optional topic (e.g. viz) with bounding boxes/labels (detection) or top label + score (classification).
  • Frame input: OpenFilter convention (frame.rw_bgr.image); fallback to frame.data[topic].
  • Device selection: CPU or CUDA. Model compatibility: Works with dict and object outputs from processors (e.g. RT-DETR, DETR).

Architecture

The filter follows the OpenFilter pattern with three main stages:

Stage Responsibilities

Stage Responsibility
setup() Parse and validate configuration; resolve backend by detection_type, load processor and model; set device
process() Core operation: run backend inference on frame images, attach results, optionally produce visualization frame
shutdown() Clean up resources (unload backend/model) when filter stops

Data Signature

The filter returns processed frames with the following data structure:

Main Frame Data:

  • Original frame data preserved (existing meta keys such as id, ts, src, src_fps are kept).
  • Processing results added to frame.data["meta"]:
    • Object detection: detections (list of { class, rois } normalized [0,1]), detection_confidence, detection_type, task, model.
    • Image classification: no detections nor detection_confidence. Only classification: { classes, confidences, architecture, timestamp, filter_id, model_id, revision, top_k }, plus detection_type, task, model.

Visualization Topic (when draw_visualization=True):

  • A separate frame is published on the configured topic (e.g. viz).
  • Image has bounding boxes and labels drawn; frame.data["meta"] preserves upstream meta and includes either detection fields or classification (same shape as main).

Installation

# Install with development dependencies
make install

Configuration

  1. Create a .env file in the project root (or copy from env.example if present).

  2. Edit .env with your configuration:

# Required: Hugging Face model id (e.g. PekingU/rtdetr_r50vd)
MODEL_ID=PekingU/rtdetr_r50vd

# Required: Model revision (for reproducibility)
REVISION=main

# Required for pipeline script: path to input video
VIDEO_PATH=./filter_example_video.mp4

# Optional: Detection confidence threshold in [0, 1] (default: 0.3)
THRESHOLD=0.3

# Optional: Visualization (default: false)
DRAW_VISUALIZATION=true

# Optional: Webvis port (default: 8010)
PORT=8010

Configuration Matrix

Variable Type Default Required Notes
model_id string Yes Hugging Face model id (e.g. PekingU/rtdetr_r50vd)
revision string Yes Model revision (reproducibility)
detection_type string "closed-vocabulary" No image-classification, closed-vocabulary, open-vocabulary, or open-vocabulary-grounding
top_k int 5 No For image-classification: number of top classes to return (1–1000)
text_labels list For zero-shot / grounding List of list of str, e.g. [["a photo of a cat", "a photo of a dog"]]
threshold float 0.3 No Detection confidence threshold [0, 1] (not used for image-classification)
device string "cpu" No "cpu" or "cuda" / cuda device index
max_detections int 100 No Maximum number of detections per frame (object detection only)
input_topic string "main" No Topic to read frame image from
output_topic string "main" No Topic for processed frame
draw_visualization bool false No Publish a topic with boxes/labels drawn
visualization_topic string "viz" No Topic name for visualization frame
visualization_alpha float 0.7 No Overlay alpha (reserved)
visualization_source_topic string No Optional source topic for viz image

Usage

Use the script that matches your method (see table above). All scripts run VideoIn → FilterHuggingfaceVision → Webvis and serve the UI at http://localhost:PORT (default 8010).

Image classification pipeline

Run image classification with a ViT, ConvNeXt, or any AutoModelForImageClassification model:

# In .env: MODEL_ID (e.g. google/vit-base-patch16-224 or facebook/convnext-tiny-224), REVISION=main, VIDEO_PATH, optional TOP_K
python scripts/image_classification.py

Output: frame.data["meta"] with detection_type, task, model, and classification (architecture, classes, confidences, etc.). No detections or detection_confidence for classification. Visualization shows the top label + score on the image.

Closed-vocabulary (object detection pipeline)

Run the pipeline with a fixed-vocabulary model (DETR, RT-DETR, Conditional DETR):

# Ensure MODEL_ID, REVISION, and VIDEO_PATH are set (e.g. in .env)
python scripts/object_detection.py

This will:

  1. Load video from VIDEO_PATH
  2. Run Hugging Face object detection on each frame (detection_type=closed-vocabulary)
  3. Serve visualization at http://localhost:8010 (or PORT); subscribe to main and viz when DRAW_VISUALIZATION is enabled

Zero-shot object detection (OWL-ViT)

Run the zero-shot script (model and text_labels are set in the script):

# Set VIDEO_PATH in .env; edit TEXT_LABELS in scripts/zero_shot_object_detection.py if needed
python scripts/zero_shot_object_detection.py

Or use the filter with detection_type="open-vocabulary", model google/owlvit-base-patch32, and text_labels (list of list of str):

from filter_huggingface_vision.filter import FilterHuggingfaceVision, FilterHuggingfaceVisionConfig

FilterHuggingfaceVisionConfig(
    ...
    detection_type="open-vocabulary",
    model_id="google/owlvit-base-patch32",
    revision="main",
    text_labels=[["a photo of a cat", "a photo of a dog"]],
    threshold=0.1,
)

Output format is the same: frame.data["meta"] with detections (list of {class, rois} normalized), detection_confidence.

Grounding DINO pipeline

Run open-vocabulary detection with Grounding DINO (model fixed in script; only VIDEO_PATH required in .env):

# Set VIDEO_PATH in .env (e.g. VIDEO_PATH=./filter_example_video.mp4)
python scripts/grounding_dino.py

See docs/supported-models.md for supported Grounding DINO model IDs and config examples.

Using Makefile

# Run with default pipeline (from Makefile PIPELINE)
make run

# Run unit tests
make test

# Run tests with coverage
make test-coverage

Visualization

When draw_visualization=True, the filter publishes an additional frame on the visualization topic (e.g. viz): bounding boxes and labels for object detection, or top label + score for image classification. Webvis subscribes to both main and viz so you can view results overlaid on the video.

Output Structure

All results are written to frame.data["meta"]. Upstream keys (id, ts, src, src_fps) are preserved; the filter adds or updates:

Field Type Description
detections list Object detection only. Each item: { "class": "<label>", "rois": [[xmin, ymin, xmax, ymax]] } with coordinates normalized in [0, 1]. Not set for image-classification.
detection_confidence float Object detection only. Mean of detection scores. Not set for image-classification.
detection_type string Method used: closed-vocabulary, open-vocabulary, open-vocabulary-grounding, or image-classification.
task string object-detection, zero-shot-object-detection, or image-classification.
model object { "id": "<model_id>", "revision": "<revision>" } (Hugging Face model).
classification object Image-classification only. { "classes", "confidences", "architecture", "timestamp", "filter_id", "model_id", "revision", "top_k" }. Classification output has no detections nor detection_confidence.

Object detection example (frame.data["meta"]):

{
  "id": 38,
  "ts": 1761090922.42,
  "src": "file:///path/to/video.mp4",
  "src_fps": 25.0,
  "detections": [
    { "class": "person", "rois": [[0.12, 0.19, 0.35, 0.46]] }
  ],
  "detection_confidence": 0.95,
  "detection_type": "closed-vocabulary",
  "task": "object-detection",
  "model": { "id": "PekingU/rtdetr_r50vd", "revision": "main" }
}

Image classification (frame.data["meta"]):

{
  "id": 38,
  "ts": 1761090922.42,
  "src": "file:///path/to/video.mp4",
  "src_fps": 25.0,
  "detection_type": "image-classification",
  "task": "image-classification",
  "model": { "id": "facebook/convnext-tiny-224", "revision": "main" },
  "classification": {
    "classes": ["tabby cat", "Egyptian cat"],
    "confidences": [0.42, 0.31],
    "architecture": "huggingface",
    "timestamp": 1761090922.42,
    "filter_id": "filter_huggingface_vision",
    "model_id": "facebook/convnext-tiny-224",
    "revision": "main",
    "top_k": 5
  }
}

Development

Project Structure

filter-huggingface-vision/
├── filter_huggingface_vision/
│   ├── filter.py              # Main filter implementation
│   └── backends/              # One backend per HF API (image_classification, object_detection, owlvit, grounding_dino)
├── scripts/
│   ├── image_classification.py
│   ├── object_detection.py
│   ├── zero_shot_object_detection.py
│   └── grounding_dino.py
├── docs/
│   ├── overview.md
│   ├── object-detection.md
│   └── supported-models.md
├── tests/
└── pyproject.toml

Key Dependencies

  • openfilter[all]>=0.1.21 - Filter framework
  • transformers>=4.40.0 - Hugging Face APIs (AutoImageProcessor + AutoModelForImageClassification / AutoModelForObjectDetection, OwlViT, AutoModelForZeroShotObjectDetection)
  • torch - Inference
  • pillow - Image handling
  • huggingface-hub - Model loading
  • python-dotenv - Environment configuration

Testing

make test
make test-coverage

Troubleshooting

Model or revision errors

  • Ensure MODEL_ID and REVISION are set. The model must be compatible with the API for your detection_type: e.g. for image-classification use a model that loads with AutoModelForImageClassification (ViT, ConvNeXt); for closed-vocabulary use AutoModelForObjectDetection (RT-DETR, DETR). See Supported Hugging Face APIs and docs/supported-models.md.
  • Use a specific revision (e.g. main or a commit hash) for reproducibility.

CUDA / device

  • Set device to "cpu" if no GPU is available.
  • For GPU, use device="cuda" or device=0 (and ensure PyTorch is built with CUDA).

No detections in frame

  • Check that the input frame provides an image via frame.rw_bgr.image or frame.data[input_topic].
  • Lower threshold (e.g. 0.2) to see more detections; increase for fewer false positives.

Visualization not showing

  • Set draw_visualization=True in the filter config.
  • Ensure Webvis (or your client) subscribes to both the main topic and the visualization topic (e.g. viz).

Documentation

For more detail, pipeline examples, variable reference, and supported model IDs per method:

License

See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filter_huggingface_vision-0.3.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file filter_huggingface_vision-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for filter_huggingface_vision-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 279c6460acc4510c3909d2a59f9e6fe723ae390202aa83efb63a3e379b0965a2
MD5 735c976d9cf8bf39a47202f375b31faa
BLAKE2b-256 7e57e10776580ea96a9af2e00bb873750dc7f45f648a264f6ce45f4bf0380de6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page