Skip to main content

No project description provided

Project description

filter-sam3-detector

License: Apache 2.0 + SAM

OpenFilter implementation for SAM3 (Segment Anything Model 3) object detection with open-set capabilities.

Features

  • Open-Set Detection: Detect objects not in standard training datasets
  • Dual Prompting Modes: Text prompts or exemplar images (few-shot learning)
  • Reference Box Prompts: Positive/negative bounding boxes on the original image (SAM3-style geometric prompts; optional text prompt)
  • Flexible Output: Bounding boxes, segmentation masks, and confidence scores
  • GPU Acceleration: CUDA, CPU, and MPS (Apple Silicon) support
  • Real-time Processing: Processes video streams in real-time
  • Pipeline Integration: Works seamlessly with OpenFilter pipeline architecture
  • Environment Configuration: Full configuration through environment variables
  • Performance Optimized: Configurable detection limits, resolution control
  • Fault Tolerant: Handles errors gracefully, forwards frames on failure
  • Cost Efficient: Local inference, no API costs

Architecture

The filter follows the OpenFilter pattern with three main stages:

Stage Responsibilities

Stage Responsibility
setup() Load SAM3 model from HuggingFace; load and process exemplar images; initialize device (CUDA/CPU/MPS)
process() Core operation: run SAM3 inference on frames; extract detections; attach results to frame metadata
shutdown() Clean up resources (release model, clear GPU memory) when filter stops

Data Signature

The filter returns processed frames with the following data structure:

Frame Metadata:

  • Original frame data preserved
  • Detection results added to frame.data['meta'][output_label]:
    [
      {
        "box": [x1, y1, x2, y2],  # Bounding box coordinates
        "score": 0.95,            # Confidence score (0.0-1.0)
        "mask": [[...]]           # Binary mask as 2D array (optional)
      },
      ...
    ]
    

Installation

See INSTALL.md for detailed installation instructions.

Quick install:

# Clone repository
git clone <repository-url>
cd filter-sam3-detector

# Install package
uv pip install -e .

# Or with development dependencies
uv pip install -e ".[dev]"

Get Started

For a first run with Docker Compose, including examples for:

  • FILTER_TEXT_PROMPT
  • FILTER_TEXT_PROMPTS
  • positive reference boxes and reference images

use QUICKSTART.md.

The quick start uses detached compose commands:

docker compose -f docker-compose.yaml up -d

Configuration

  1. Copy the example environment file:
cp .env.example .env
  1. Edit .env file with your configuration:
# Prompt configuration (choose one)
FILTER_TEXT_PROMPT=person                    # Text prompt for detection
FILTER_EXEMPLARS_PATH=./exemplars/           # Path to exemplar images directory
# FILTER_POSITIVE_BOXES='[[x,y,w,h],...]'    # Reference boxes (positive), JSON array of [x,y,w,h] in pixels
# FILTER_NEGATIVE_BOXES='[[x,y,w,h],...]'    # Reference boxes (negative), JSON array of [x,y,w,h] in pixels

# Model configuration
FILTER_MODEL_ID=facebook/sam3                # HuggingFace model ID
FILTER_DEVICE=cuda                           # Device: cuda, cpu, or mps

# Detection parameters
FILTER_CONFIDENCE_THRESHOLD=0.5              # Minimum confidence (0.0-1.0)
FILTER_MASK_THRESHOLD=0.5                    # Mask binarization threshold
FILTER_MAX_DETECTIONS=100                    # Maximum detections per frame

# Output configuration
FILTER_OUTPUT_MASKS=true                     # Output segmentation masks
FILTER_OUTPUT_BOXES=true                     # Output bounding boxes
FILTER_OUTPUT_SCORES=true                    # Output confidence scores
FILTER_OUTPUT_LABEL=sam3_detections          # Key in frame.data['meta']

# Visualization and debugging
FILTER_VISUALIZE=false                       # Draw detections on frames
# FILTER_VIZ_TOPIC=viz                       # When set: main=original+meta, this topic=drawn frame+meta
FILTER_DEBUG=false                           # Enable debug logging

Configuration Matrix

Variable Type Default Required Notes
text_prompt string None No* Natural language description (e.g., "person", "car")
exemplars_path string None No* Path to directory with exemplar images
model_id string "facebook/sam3" No HuggingFace model ID or local path
device string "cuda" No Device: "cuda", "cpu", or "mps"
confidence_threshold float 0.5 No Minimum confidence (0.0-1.0)
mask_threshold float 0.5 No Mask binarization threshold (0.0-1.0)
max_detections int 100 No Maximum detections per frame
output_masks bool true No Output segmentation masks
output_boxes bool true No Output bounding boxes
output_scores bool true No Output confidence scores
output_label string "sam3_detections" No Key for storing results
visualize bool false No Draw detections on output frames
viz_topic string "" No When set (e.g. viz), main gets original frame + meta; this topic gets drawn frame + meta. Empty = legacy (visualize draws on main).
ref_images string None No Comma-separated paths for positive ref images (pasted on composite). Ignored when positive_boxes or negative_boxes are set.
ref_images_negative string None No Comma-separated paths for negative ref images. Ignored when ref boxes are set.
composite_topic string "" No When set (e.g. composite), publish the composite image (frame + refs) on this topic when REF_IMGS are in use.
debug bool false No Enable debug logging

* When using positive_boxes or negative_boxes, a text prompt is optional (the model can use the placeholder "visual"). Otherwise either text_prompt or exemplars_path must be provided. When using REF_IMGS (ref images), a text prompt is required; REF_IMGS are disabled when ref boxes are set.

Reference box prompts

In single-output mode you can add reference bounding boxes on the original image (no composite): set FILTER_POSITIVE_BOXES and/or FILTER_NEGATIVE_BOXES to a JSON array of boxes, each box [x, y, width, height] in pixels. Positive boxes encourage detections similar to those regions; negative boxes suppress them. Example in .env:

FILTER_POSITIVE_BOXES="[[480, 290, 110, 360], [370, 280, 115, 375]]"
FILTER_NEGATIVE_BOXES="[[100, 100, 50, 200]]"

Text prompt is optional when using reference boxes. With FILTER_VISUALIZE=true, positive ref boxes are drawn in green, negative in red, and detections in blue.

Rule: when FILTER_POSITIVE_BOXES or FILTER_NEGATIVE_BOXES are set, reference images (REF_IMGS) are not used — only the reference-boxes mode on the original image is applied. Set REF_IMGS only when you are not using ref boxes.

Reference images (REF_IMGS)

You can pass reference images (positive and/or negative) that are pasted on a composite (frame + refs) for visual prompting. Set FILTER_REF_IMAGES and/or FILTER_REF_IMAGES_NEGATIVE to comma-separated paths (files or directories; directories are expanded to image files). A text prompt is required when using REF_IMGS. To view the composite image in the pipeline, set FILTER_COMPOSITE_TOPIC=composite and ensure the filter outputs include the composite topic (e.g. in Webvis you can open /composite).

Usage

Method 1: Using Example Scripts (Recommended)

Scripts read configuration from environment variables (e.g. from a .env file). Copy env.example to .env and set at least VIDEO_PATH and FILTER_TEXT_PROMPT.

Object Detection with Text Prompts

Scripts read configuration from environment variables (use a .env file or pass them inline):

# Set in .env: VIDEO_PATH, FILTER_TEXT_PROMPT, FILTER_OUTPUT_DIR, etc.
python scripts/filter_object_detection.py

Or pass variables inline:

# Detect people in a video
VIDEO_PATH=input.mp4 FILTER_TEXT_PROMPT=person FILTER_OUTPUT_DIR=./results \
  FILTER_CONFIDENCE_THRESHOLD=0.5 python scripts/filter_object_detection.py

# Detect cars with visualization
VIDEO_PATH=traffic.mp4 FILTER_TEXT_PROMPT=car FILTER_CONFIDENCE_THRESHOLD=0.6 \
  FILTER_VISUALIZE=true FILTER_OUTPUT_DIR=./cars python scripts/filter_object_detection.py

# Process multiple videos (run once per video)
VIDEO_PATH=video1.mp4 FILTER_TEXT_PROMPT=dog FILTER_OUTPUT_DIR=./detections \
  python scripts/filter_object_detection.py
# Then VIDEO_PATH=video2.mp4 ... and VIDEO_PATH=video3.mp4 ...

Optional: FILTER_VIDEO_LOOP=true keeps the video looping so frames are still available after the model loads (~14s); useful for short videos.

Detection with Reference Boxes

Use positive and/or negative reference bounding boxes on the frame (SAM3-style geometric prompts) with or without a text prompt:

# In .env set: VIDEO_PATH, and FILTER_POSITIVE_BOXES and/or FILTER_NEGATIVE_BOXES (JSON arrays of [x,y,w,h])
# Optional: FILTER_TEXT_PROMPT for text-guided detection
python scripts/filter_object_detection_exemplar.py

Reference boxes: Set FILTER_POSITIVE_BOXES and/or FILTER_NEGATIVE_BOXES to a JSON array of boxes, each [x, y, width, height] in pixels. Example: FILTER_POSITIVE_BOXES="[[480, 290, 110, 360]]". Text prompt is optional. With FILTER_VISUALIZE=true, ref boxes are drawn in green (positive) and red (negative), detections in blue.

Method 2: Docker Pipeline

Run the complete detection pipeline with Docker Compose. The prebuilt image is published to Docker Hub at plainsightai/openfilter-sam3-detector and is publicly pullable — no auth required.

# 1. Copy your video to the data directory
cp your_video.mp4 data/sample-video.mp4

# 2. Pull the prebuilt image (compose will also pull on first run if
#    missing). Model weights are baked in, so no HF_TOKEN at runtime.
#    Set SAM3_DETECTOR_VERSION to pin to a specific release for
#    reproducibility — defaults to `latest`.
docker pull plainsightai/openfilter-sam3-detector:latest

# 3. Run the pipeline
FILTER_TEXT_PROMPT="person" docker compose up

# 4. View results at http://localhost:8001 (webvis)
# Temporal intervals are streamed to output/intervals.json
Build from source instead
# Required to download the gated SAM3 weights at build time.
export HF_TOKEN="your_huggingface_token"

# HF_TOKEN is passed as a BuildKit secret, not an env var or build arg,
# so it never ends up in an image layer. Compose does not forward the token,
# so invoke `docker build` directly and tag it to match docker-compose.yaml.
docker build --secret id=hf_token,env=HF_TOKEN \
  -t plainsightai/openfilter-sam3-detector:latest .

Pipeline Architecture:

video_in → sam3_detector (with integrated temporal intervals) → webvis
               ↓
         output/intervals.json (streamed)

Requirements:

  • Docker with NVIDIA Container Toolkit
  • CUDA-compatible GPU (sm_50+ including RTX 50-series/Blackwell)
  • HuggingFace account with access to gated models

Environment Variables:

Variable Default Description
HF_TOKEN - HuggingFace token for model access (build-time)
FILTER_TEXT_PROMPT "person" What to detect
FILTER_HALF_LIFE 5.0 EMA decay rate (frames)
FILTER_PRESENCE_THRESHOLD 0.4 Detection threshold

Note: SAM3 weights are baked into the image at build time, and the container runs with HF_HUB_OFFLINE=1 / TRANSFORMERS_OFFLINE=1. No network or HF_TOKEN is needed at runtime — the image is safe to run with --network=none.

Output Format (intervals.json):

{
  "intervals": [
    {"start_frame": 23, "end_frame": 69, "label": "person", "present": true, "confidence": 0.95}
  ],
  "total_frames": 463
}

Method 3: Using as a Standalone Filter

# Set environment variables
export FILTER_TEXT_PROMPT="person"
export FILTER_CONFIDENCE_THRESHOLD=0.7
export FILTER_DEVICE=cuda
export FILTER_SOURCES="tcp://127.0.0.1:5555"
export FILTER_OUTPUTS="tcp://127.0.0.1:5556"

# Run the filter
filter-sam3-detector

Method 4: Using in Python Code

from filter_sam3_detector import FilterSAM3Detector
from openfilter.filter_runtime.filter import Filter
from openfilter.filter_runtime.filters.video_in import VideoIn
from openfilter.filter_runtime.filters.recorder import Recorder

# Define pipeline
filters = [
    (VideoIn, {
        "sources": "file://input.mp4",
        "outputs": ["tcp://127.0.0.1:5555"],
    }),
    (FilterSAM3Detector, {
        "sources": "tcp://127.0.0.1:5555",
        "outputs": ["tcp://127.0.0.1:5556"],
        "text_prompt": "person",
        "confidence_threshold": 0.5,
        "device": "cuda",
    }),
    (Recorder, {
        "sources": "tcp://127.0.0.1:5556",
        "path": "detections.jsonl",
        "format": "jsonl",
    }),
]

# Run pipeline
Filter.run_multi(filters)

Temporal Interval Detection

Convert noisy per-frame detections into stable presence/absence intervals using EMA smoothing.

Quick Start (Docker - Recommended)

# Run the integrated pipeline (temporal intervals built into SAM3 detector).
# Set SAM3_DETECTOR_VERSION to pin to a specific release if you need
# reproducibility; defaults to `latest`.
cp your_video.mp4 data/sample-video.mp4
docker pull plainsightai/openfilter-sam3-detector:latest
FILTER_TEXT_PROMPT="person" docker compose up

# Intervals stream to output/intervals.json as detection progresses

Quick Start (Python Script)

# Run on any video with custom prompts
uv run python scripts/run_temporal_intervals.py video.mp4 \
    --prompts "person,hand,cup" \
    --output results.json

Integrated Mode (Recommended)

Enable temporal intervals directly in the SAM3 detector - no separate filter needed:

from filter_sam3_detector import FilterSAM3Detector

# Single filter with integrated temporal tracking
pipeline = [
    (FilterSAM3Detector, {
        "text_prompt": "person",
        "output_label": "detections",
        # Integrated temporal intervals
        "enable_temporal_intervals": True,
        "temporal_streaming_mode": True,  # Emit incrementally
        "temporal_half_life": 5.0,
        "temporal_presence_threshold": 0.4,
        "temporal_output_json_path": "intervals.json",
    }),
]

Separate Filter Mode (Legacy)

For pipelines requiring separate filter stages:

from filter_sam3_detector import FilterSAM3Detector
from filter_sam3_detector.temporal_intervals import TemporalIntervalFilter

# SAM3 detector -> Temporal interval filter
pipeline = [
    (FilterSAM3Detector, {
        "text_prompt": "person",
        "output_label": "detections",
    }),
    (TemporalIntervalFilter, {
        "detection_key": "detections",
        "half_life": 5.0,           # EMA responsiveness (frames)
        "presence_threshold": 0.4,  # Detection threshold
        "output_json_path": "intervals.json",
    }),
]

Output Format

{
  "intervals": [
    {"start_frame": 20, "end_frame": 150, "label": "person", "present": true, "confidence": 0.92},
    {"start_frame": 151, "end_frame": 180, "label": "person", "present": false, "confidence": 0.15}
  ],
  "total_frames": 200
}

Configuration Options

Option Default Description
enable_temporal_intervals false Enable integrated temporal tracking
temporal_streaming_mode false Emit intervals incrementally (vs. at end)
temporal_half_life 5.0 Frames for 50% EMA decay
temporal_presence_threshold 0.4 EMA score to trigger presence
temporal_output_json_path None Path to write intervals JSON
temporal_emit_on_change true Only emit when state changes

Usage Scenarios

1. Person Detection

Set in .env: VIDEO_PATH, FILTER_TEXT_PROMPT=person, FILTER_OUTPUT_DIR, FILTER_CONFIDENCE_THRESHOLD=0.6. Then:

python scripts/filter_object_detection.py

2. Vehicle Detection

Set VIDEO_PATH, FILTER_TEXT_PROMPT=car, FILTER_OUTPUT_DIR, FILTER_RESIZE=480. Then run python scripts/filter_object_detection.py.

3. Detection with Reference Boxes

Use bounding boxes on the frame as positive/negative prompts (with or without text):

# In .env: VIDEO_PATH, FILTER_POSITIVE_BOXES='[[x,y,w,h],...]', FILTER_NEGATIVE_BOXES (optional), FILTER_TEXT_PROMPT (optional)
python scripts/filter_object_detection_exemplar.py

4. Pipeline Integration

Combine with other OpenFilter filters:

from openfilter.filter_runtime.filter import Filter
from openfilter.filter_runtime.filters.video_in import VideoIn
from openfilter.filter_runtime.filters.resize import Resize
from openfilter.filter_runtime.filters.recorder import Recorder
from filter_sam3_detector import FilterSAM3Detector

filters = [
    (VideoIn, {"sources": "file://input.mp4"}),
    (Resize, {"width": 640, "height": 480}),  # Pre-processing
    (FilterSAM3Detector, {"text_prompt": "person"}),
    (Recorder, {"path": "output.jsonl"}),
]

Filter.run_multi(filters)

Output Format

Detections are stored in frame.data['meta'][output_label]:

[
  {
    "box": [x1, y1, x2, y2],  # Bounding box coordinates
    "score": 0.95,            # Confidence score (0.0-1.0)
    "mask": [[...]]           # Binary mask as 2D array (if output_masks=True)
  },
  ...
]

When using the Recorder filter, detections are saved in JSONL format:

{
  "frame_id": 0,
  "meta": {
    "sam3_detections": [
      {
        "box": [100, 150, 200, 250],
        "score": 0.95,
        "mask": [[0, 0, 1, 1, ...]]
      }
    ]
  }
}

Performance Tips

Image Processing

  • Resize Videos: Use --resize 480 for faster processing
  • Limit Detections: Reduce FILTER_MAX_DETECTIONS for better performance
  • Disable Masks: Set FILTER_OUTPUT_MASKS=false to save memory

Device Selection

  • Use GPU: Set FILTER_DEVICE=cuda for 10-50x speedup
  • CPU Fallback: Automatically falls back to CPU if GPU unavailable
  • Apple Silicon: Use FILTER_DEVICE=mps on macOS

Confidence Thresholds

  • Text Prompts: Default 0.5 works well
  • Exemplar-Based: Use 0.3 for better recall
  • High Precision: Use 0.7 or higher
  • High Recall: Use 0.3 or lower

Development

Project Structure

filter-sam3-detector/
├── filter_sam3_detector/
│   ├── __init__.py
│   └── filter.py              # Main filter implementation
├── scripts/                   # Example usage scripts
│   ├── filter_object_detection.py       # Video pipeline (text prompt)
│   ├── filter_object_detection_exemplar.py  # Video pipeline (reference boxes + optional text)
│   └── run_temporal_intervals.py
├── examples/                  # Additional examples
│   └── detect_objects_video.py
├── docs/                      # Documentation
│   ├── API.md
│   ├── configuration.md
│   ├── advanced-usage.md
│   └── performance.md
├── tests/                     # Test files
│   ├── test_filter.py
│   └── test_integration.py
├── sam3/                      # Vendorized SAM3 library
├── env.example               # Environment configuration example
└── pyproject.toml           # Project dependencies

Key Dependencies

  • openfilter[all]>=0.1.0 - Filter framework
  • torch>=2.0.0 - PyTorch for model inference
  • torchvision>=0.15.0 - Image processing
  • transformers>=4.40.0 - HuggingFace model loading
  • opencv-python>=4.8.0 - Image manipulation
  • pillow>=10.0.0 - Image processing
  • numpy>=1.24.0 - Numerical operations

Testing

# Run tests
make test

# Run tests with coverage (pass extra pytest args via PYTEST_ARGS)
make test PYTEST_ARGS="--cov=filter_sam3_detector --cov-report=term"

# Check code quality
make lint

# Format code
make format

Known Issues

Exemplar-Based Detection Not Working

Status: Bug in _load_exemplar_images() - backbone output format handling is incorrect.

Symptoms: When using exemplars_path, you may see warnings like:

WARNING  Failed to load exemplar example.jpg: 'NoneType' object is not subscriptable
ERROR    No exemplar images could be loaded

Root Cause: The code at filter.py:853-858 doesn't properly handle the SAM3 backbone output format. The backbone returns features in a different structure than expected.

Workaround: Use text prompts (text_prompt) instead of exemplar images until this is fixed.

Tracking: This issue affects the few-shot learning functionality. Text-based detection works correctly.

Troubleshooting

Model Loading Issues

Problem: Model fails to load or takes too long

Solutions:

  • Ensure you have sufficient GPU memory (recommended: 8GB+)
  • Use CPU mode if GPU is unavailable: --device cpu
  • Check internet connection (model downloads from HuggingFace on first use)
  • Verify CUDA installation: nvidia-smi

No Detections Found

Problem: Filter runs but finds no objects

Solutions:

  • Lower confidence threshold: --confidence 0.3
  • Try different text prompts (be more specific or more general)
  • For exemplar-based: ensure exemplar images are clear and representative
  • Check that input video has the objects you're looking for

Out of Memory Errors

Problem: CUDA out of memory errors

Solutions:

  • Resize input: --resize 480
  • Reduce max detections: export FILTER_MAX_DETECTIONS=50
  • Disable masks: export FILTER_OUTPUT_MASKS=false
  • Use CPU mode: --device cpu (slower but uses less memory)

Import Errors

Problem: ImportError: cannot import name 'FilterSAM3Detector'

Solutions:

  • Ensure package is installed: uv pip install -e .
  • Check Python version (requires 3.10+)
  • Verify all dependencies are installed
  • Reinstall: uv pip install -e . --force-reinstall

Slow Processing

Problem: Processing is very slow

Solutions:

  • Use GPU: --device cuda
  • Resize videos: --resize 480
  • Reduce max detections
  • Disable masks if not needed
  • Process fewer frames (use sample rate in video input)

Performance Optimization

To improve processing speed:

  1. Use GPU acceleration (FILTER_DEVICE=cuda)
  2. Resize inputs to appropriate resolution (--resize 480)
  3. Limit detections (FILTER_MAX_DETECTIONS=50)
  4. Disable unused outputs (masks if not needed)
  5. Use smaller model variant (if available)

Documentation

For more detailed information, configuration examples, and advanced usage scenarios, see the comprehensive documentation:

License

This project uses dual licensing. The filter wrapper code is licensed under Apache 2.0, and the vendorized SAM3 library (sam3/) is licensed under the SAM License, which includes trade control restrictions. See LICENSING.md for full details.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filter_sam3_detector-0.1.18-py3-none-any.whl (72.3 kB view details)

Uploaded Python 3

File details

Details for the file filter_sam3_detector-0.1.18-py3-none-any.whl.

File metadata

File hashes

Hashes for filter_sam3_detector-0.1.18-py3-none-any.whl
Algorithm Hash digest
SHA256 7791a92e8055c8da5f6c33502ac2160c7d2d4db6672872362981eb1ed0ad3685
MD5 e024fd8bb235c69dca84f5504b9f7e99
BLAKE2b-256 924819dadcbe31fd556837a891589e531ac4e5f785442b9efc0a8107251ddc0c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page