No project description provided
Project description
filter-sam3-detector
OpenFilter implementation for SAM3 (Segment Anything Model 3) object detection with open-set capabilities.
Features
- Open-Set Detection: Detect objects not in standard training datasets
- Dual Prompting Modes: Text prompts or exemplar images (few-shot learning)
- Reference Box Prompts: Positive/negative bounding boxes on the original image (SAM3-style geometric prompts; optional text prompt)
- Flexible Output: Bounding boxes, segmentation masks, and confidence scores
- GPU Acceleration: CUDA, CPU, and MPS (Apple Silicon) support
- Real-time Processing: Processes video streams in real-time
- Pipeline Integration: Works seamlessly with OpenFilter pipeline architecture
- Environment Configuration: Full configuration through environment variables
- Performance Optimized: Configurable detection limits, resolution control
- Fault Tolerant: Handles errors gracefully, forwards frames on failure
- Cost Efficient: Local inference, no API costs
Architecture
The filter follows the OpenFilter pattern with three main stages:
Stage Responsibilities
| Stage | Responsibility |
|---|---|
setup() |
Load SAM3 model from HuggingFace; load and process exemplar images; initialize device (CUDA/CPU/MPS) |
process() |
Core operation: run SAM3 inference on frames; extract detections; attach results to frame metadata |
shutdown() |
Clean up resources (release model, clear GPU memory) when filter stops |
Data Signature
The filter returns processed frames with the following data structure:
Frame Metadata:
- Original frame data preserved
- Detection results added to
frame.data['meta'][output_label]:[ { "box": [x1, y1, x2, y2], # Bounding box coordinates "score": 0.95, # Confidence score (0.0-1.0) "mask": [[...]] # Binary mask as 2D array (optional) }, ... ]
Installation
See INSTALL.md for detailed installation instructions.
Quick install:
# Clone repository
git clone <repository-url>
cd filter-sam3-detector
# Install package
uv pip install -e .
# Or with development dependencies
uv pip install -e ".[dev]"
Get Started
For a first run with Docker Compose, including examples for:
FILTER_TEXT_PROMPTFILTER_TEXT_PROMPTS- positive reference boxes and reference images
use QUICKSTART.md.
The quick start uses detached compose commands:
docker compose -f docker-compose.yaml up -d
Configuration
- Copy the example environment file:
cp .env.example .env
- Edit
.envfile with your configuration:
# Prompt configuration (choose one)
FILTER_TEXT_PROMPT=person # Text prompt for detection
FILTER_EXEMPLARS_PATH=./exemplars/ # Path to exemplar images directory
# FILTER_POSITIVE_BOXES='[[x,y,w,h],...]' # Reference boxes (positive), JSON array of [x,y,w,h] in pixels
# FILTER_NEGATIVE_BOXES='[[x,y,w,h],...]' # Reference boxes (negative), JSON array of [x,y,w,h] in pixels
# Model configuration
FILTER_MODEL_ID=facebook/sam3 # HuggingFace model ID
FILTER_DEVICE=cuda # Device: cuda, cpu, or mps
# Detection parameters
FILTER_CONFIDENCE_THRESHOLD=0.5 # Minimum confidence (0.0-1.0)
FILTER_MASK_THRESHOLD=0.5 # Mask binarization threshold
FILTER_MAX_DETECTIONS=100 # Maximum detections per frame
# Output configuration
FILTER_OUTPUT_MASKS=true # Output segmentation masks
FILTER_OUTPUT_BOXES=true # Output bounding boxes
FILTER_OUTPUT_SCORES=true # Output confidence scores
FILTER_OUTPUT_LABEL=sam3_detections # Key in frame.data['meta']
# Visualization and debugging
FILTER_VISUALIZE=false # Draw detections on frames
# FILTER_VIZ_TOPIC=viz # When set: main=original+meta, this topic=drawn frame+meta
FILTER_DEBUG=false # Enable debug logging
Configuration Matrix
| Variable | Type | Default | Required | Notes |
|---|---|---|---|---|
text_prompt |
string | None | No* | Natural language description (e.g., "person", "car") |
exemplars_path |
string | None | No* | Path to directory with exemplar images |
model_id |
string | "facebook/sam3" | No | HuggingFace model ID or local path |
device |
string | "cuda" | No | Device: "cuda", "cpu", or "mps" |
confidence_threshold |
float | 0.5 | No | Minimum confidence (0.0-1.0) |
mask_threshold |
float | 0.5 | No | Mask binarization threshold (0.0-1.0) |
max_detections |
int | 100 | No | Maximum detections per frame |
output_masks |
bool | true | No | Output segmentation masks |
output_boxes |
bool | true | No | Output bounding boxes |
output_scores |
bool | true | No | Output confidence scores |
output_label |
string | "sam3_detections" | No | Key for storing results |
visualize |
bool | false | No | Draw detections on output frames |
viz_topic |
string | "" | No | When set (e.g. viz), main gets original frame + meta; this topic gets drawn frame + meta. Empty = legacy (visualize draws on main). |
ref_images |
string | None | No | Comma-separated paths for positive ref images (pasted on composite). Ignored when positive_boxes or negative_boxes are set. |
ref_images_negative |
string | None | No | Comma-separated paths for negative ref images. Ignored when ref boxes are set. |
composite_topic |
string | "" | No | When set (e.g. composite), publish the composite image (frame + refs) on this topic when REF_IMGS are in use. |
debug |
bool | false | No | Enable debug logging |
* When using positive_boxes or negative_boxes, a text prompt is optional (the model can use the placeholder "visual"). Otherwise either text_prompt or exemplars_path must be provided. When using REF_IMGS (ref images), a text prompt is required; REF_IMGS are disabled when ref boxes are set.
Reference box prompts
In single-output mode you can add reference bounding boxes on the original image (no composite): set FILTER_POSITIVE_BOXES and/or FILTER_NEGATIVE_BOXES to a JSON array of boxes, each box [x, y, width, height] in pixels. Positive boxes encourage detections similar to those regions; negative boxes suppress them. Example in .env:
FILTER_POSITIVE_BOXES="[[480, 290, 110, 360], [370, 280, 115, 375]]"
FILTER_NEGATIVE_BOXES="[[100, 100, 50, 200]]"
Text prompt is optional when using reference boxes. With FILTER_VISUALIZE=true, positive ref boxes are drawn in green, negative in red, and detections in blue.
Rule: when FILTER_POSITIVE_BOXES or FILTER_NEGATIVE_BOXES are set, reference images (REF_IMGS) are not used — only the reference-boxes mode on the original image is applied. Set REF_IMGS only when you are not using ref boxes.
Reference images (REF_IMGS)
You can pass reference images (positive and/or negative) that are pasted on a composite (frame + refs) for visual prompting. Set FILTER_REF_IMAGES and/or FILTER_REF_IMAGES_NEGATIVE to comma-separated paths (files or directories; directories are expanded to image files). A text prompt is required when using REF_IMGS. To view the composite image in the pipeline, set FILTER_COMPOSITE_TOPIC=composite and ensure the filter outputs include the composite topic (e.g. in Webvis you can open /composite).
Usage
Method 1: Using Example Scripts (Recommended)
Scripts read configuration from environment variables (e.g. from a .env file). Copy env.example to .env and set at least VIDEO_PATH and FILTER_TEXT_PROMPT.
Object Detection with Text Prompts
Scripts read configuration from environment variables (use a .env file or pass them inline):
# Set in .env: VIDEO_PATH, FILTER_TEXT_PROMPT, FILTER_OUTPUT_DIR, etc.
python scripts/filter_object_detection.py
Or pass variables inline:
# Detect people in a video
VIDEO_PATH=input.mp4 FILTER_TEXT_PROMPT=person FILTER_OUTPUT_DIR=./results \
FILTER_CONFIDENCE_THRESHOLD=0.5 python scripts/filter_object_detection.py
# Detect cars with visualization
VIDEO_PATH=traffic.mp4 FILTER_TEXT_PROMPT=car FILTER_CONFIDENCE_THRESHOLD=0.6 \
FILTER_VISUALIZE=true FILTER_OUTPUT_DIR=./cars python scripts/filter_object_detection.py
# Process multiple videos (run once per video)
VIDEO_PATH=video1.mp4 FILTER_TEXT_PROMPT=dog FILTER_OUTPUT_DIR=./detections \
python scripts/filter_object_detection.py
# Then VIDEO_PATH=video2.mp4 ... and VIDEO_PATH=video3.mp4 ...
Optional: FILTER_VIDEO_LOOP=true keeps the video looping so frames are still available after the model loads (~14s); useful for short videos.
Detection with Reference Boxes
Use positive and/or negative reference bounding boxes on the frame (SAM3-style geometric prompts) with or without a text prompt:
# In .env set: VIDEO_PATH, and FILTER_POSITIVE_BOXES and/or FILTER_NEGATIVE_BOXES (JSON arrays of [x,y,w,h])
# Optional: FILTER_TEXT_PROMPT for text-guided detection
python scripts/filter_object_detection_exemplar.py
Reference boxes: Set FILTER_POSITIVE_BOXES and/or FILTER_NEGATIVE_BOXES to a JSON array of boxes, each [x, y, width, height] in pixels. Example: FILTER_POSITIVE_BOXES="[[480, 290, 110, 360]]". Text prompt is optional. With FILTER_VISUALIZE=true, ref boxes are drawn in green (positive) and red (negative), detections in blue.
Method 2: Docker Pipeline
Run the complete detection pipeline with Docker Compose. The prebuilt image is published to Docker Hub at plainsightai/openfilter-sam3-detector and is publicly pullable — no auth required.
# 1. Copy your video to the data directory
cp your_video.mp4 data/sample-video.mp4
# 2. Pull the prebuilt image (compose will also pull on first run if
# missing). Model weights are baked in, so no HF_TOKEN at runtime.
# Set SAM3_DETECTOR_VERSION to pin to a specific release for
# reproducibility — defaults to `latest`.
docker pull plainsightai/openfilter-sam3-detector:latest
# 3. Run the pipeline
FILTER_TEXT_PROMPT="person" docker compose up
# 4. View results at http://localhost:8001 (webvis)
# Temporal intervals are streamed to output/intervals.json
Build from source instead
# Required to download the gated SAM3 weights at build time.
export HF_TOKEN="your_huggingface_token"
# HF_TOKEN is passed as a BuildKit secret, not an env var or build arg,
# so it never ends up in an image layer. Compose does not forward the token,
# so invoke `docker build` directly and tag it to match docker-compose.yaml.
docker build --secret id=hf_token,env=HF_TOKEN \
-t plainsightai/openfilter-sam3-detector:latest .
Pipeline Architecture:
video_in → sam3_detector (with integrated temporal intervals) → webvis
↓
output/intervals.json (streamed)
Requirements:
- Docker with NVIDIA Container Toolkit
- CUDA-compatible GPU (sm_50+ including RTX 50-series/Blackwell)
- HuggingFace account with access to gated models
Environment Variables:
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
- | HuggingFace token for model access (build-time) |
FILTER_TEXT_PROMPT |
"person" | What to detect |
FILTER_HALF_LIFE |
5.0 | EMA decay rate (frames) |
FILTER_PRESENCE_THRESHOLD |
0.4 | Detection threshold |
Note: SAM3 weights are baked into the image at build time, and the container runs with
HF_HUB_OFFLINE=1/TRANSFORMERS_OFFLINE=1. No network orHF_TOKENis needed at runtime — the image is safe to run with--network=none.
Output Format (intervals.json):
{
"intervals": [
{"start_frame": 23, "end_frame": 69, "label": "person", "present": true, "confidence": 0.95}
],
"total_frames": 463
}
Method 3: Using as a Standalone Filter
# Set environment variables
export FILTER_TEXT_PROMPT="person"
export FILTER_CONFIDENCE_THRESHOLD=0.7
export FILTER_DEVICE=cuda
export FILTER_SOURCES="tcp://127.0.0.1:5555"
export FILTER_OUTPUTS="tcp://127.0.0.1:5556"
# Run the filter
filter-sam3-detector
Method 4: Using in Python Code
from filter_sam3_detector import FilterSAM3Detector
from openfilter.filter_runtime.filter import Filter
from openfilter.filter_runtime.filters.video_in import VideoIn
from openfilter.filter_runtime.filters.recorder import Recorder
# Define pipeline
filters = [
(VideoIn, {
"sources": "file://input.mp4",
"outputs": ["tcp://127.0.0.1:5555"],
}),
(FilterSAM3Detector, {
"sources": "tcp://127.0.0.1:5555",
"outputs": ["tcp://127.0.0.1:5556"],
"text_prompt": "person",
"confidence_threshold": 0.5,
"device": "cuda",
}),
(Recorder, {
"sources": "tcp://127.0.0.1:5556",
"path": "detections.jsonl",
"format": "jsonl",
}),
]
# Run pipeline
Filter.run_multi(filters)
Temporal Interval Detection
Convert noisy per-frame detections into stable presence/absence intervals using EMA smoothing.
Quick Start (Docker - Recommended)
# Run the integrated pipeline (temporal intervals built into SAM3 detector).
# Set SAM3_DETECTOR_VERSION to pin to a specific release if you need
# reproducibility; defaults to `latest`.
cp your_video.mp4 data/sample-video.mp4
docker pull plainsightai/openfilter-sam3-detector:latest
FILTER_TEXT_PROMPT="person" docker compose up
# Intervals stream to output/intervals.json as detection progresses
Quick Start (Python Script)
# Run on any video with custom prompts
uv run python scripts/run_temporal_intervals.py video.mp4 \
--prompts "person,hand,cup" \
--output results.json
Integrated Mode (Recommended)
Enable temporal intervals directly in the SAM3 detector - no separate filter needed:
from filter_sam3_detector import FilterSAM3Detector
# Single filter with integrated temporal tracking
pipeline = [
(FilterSAM3Detector, {
"text_prompt": "person",
"output_label": "detections",
# Integrated temporal intervals
"enable_temporal_intervals": True,
"temporal_streaming_mode": True, # Emit incrementally
"temporal_half_life": 5.0,
"temporal_presence_threshold": 0.4,
"temporal_output_json_path": "intervals.json",
}),
]
Separate Filter Mode (Legacy)
For pipelines requiring separate filter stages:
from filter_sam3_detector import FilterSAM3Detector
from filter_sam3_detector.temporal_intervals import TemporalIntervalFilter
# SAM3 detector -> Temporal interval filter
pipeline = [
(FilterSAM3Detector, {
"text_prompt": "person",
"output_label": "detections",
}),
(TemporalIntervalFilter, {
"detection_key": "detections",
"half_life": 5.0, # EMA responsiveness (frames)
"presence_threshold": 0.4, # Detection threshold
"output_json_path": "intervals.json",
}),
]
Output Format
{
"intervals": [
{"start_frame": 20, "end_frame": 150, "label": "person", "present": true, "confidence": 0.92},
{"start_frame": 151, "end_frame": 180, "label": "person", "present": false, "confidence": 0.15}
],
"total_frames": 200
}
Configuration Options
| Option | Default | Description |
|---|---|---|
enable_temporal_intervals |
false | Enable integrated temporal tracking |
temporal_streaming_mode |
false | Emit intervals incrementally (vs. at end) |
temporal_half_life |
5.0 | Frames for 50% EMA decay |
temporal_presence_threshold |
0.4 | EMA score to trigger presence |
temporal_output_json_path |
None | Path to write intervals JSON |
temporal_emit_on_change |
true | Only emit when state changes |
Usage Scenarios
1. Person Detection
Set in .env: VIDEO_PATH, FILTER_TEXT_PROMPT=person, FILTER_OUTPUT_DIR, FILTER_CONFIDENCE_THRESHOLD=0.6. Then:
python scripts/filter_object_detection.py
2. Vehicle Detection
Set VIDEO_PATH, FILTER_TEXT_PROMPT=car, FILTER_OUTPUT_DIR, FILTER_RESIZE=480. Then run python scripts/filter_object_detection.py.
3. Detection with Reference Boxes
Use bounding boxes on the frame as positive/negative prompts (with or without text):
# In .env: VIDEO_PATH, FILTER_POSITIVE_BOXES='[[x,y,w,h],...]', FILTER_NEGATIVE_BOXES (optional), FILTER_TEXT_PROMPT (optional)
python scripts/filter_object_detection_exemplar.py
4. Pipeline Integration
Combine with other OpenFilter filters:
from openfilter.filter_runtime.filter import Filter
from openfilter.filter_runtime.filters.video_in import VideoIn
from openfilter.filter_runtime.filters.resize import Resize
from openfilter.filter_runtime.filters.recorder import Recorder
from filter_sam3_detector import FilterSAM3Detector
filters = [
(VideoIn, {"sources": "file://input.mp4"}),
(Resize, {"width": 640, "height": 480}), # Pre-processing
(FilterSAM3Detector, {"text_prompt": "person"}),
(Recorder, {"path": "output.jsonl"}),
]
Filter.run_multi(filters)
Output Format
Detections are stored in frame.data['meta'][output_label]:
[
{
"box": [x1, y1, x2, y2], # Bounding box coordinates
"score": 0.95, # Confidence score (0.0-1.0)
"mask": [[...]] # Binary mask as 2D array (if output_masks=True)
},
...
]
When using the Recorder filter, detections are saved in JSONL format:
{
"frame_id": 0,
"meta": {
"sam3_detections": [
{
"box": [100, 150, 200, 250],
"score": 0.95,
"mask": [[0, 0, 1, 1, ...]]
}
]
}
}
Performance Tips
Image Processing
- Resize Videos: Use
--resize 480for faster processing - Limit Detections: Reduce
FILTER_MAX_DETECTIONSfor better performance - Disable Masks: Set
FILTER_OUTPUT_MASKS=falseto save memory
Device Selection
- Use GPU: Set
FILTER_DEVICE=cudafor 10-50x speedup - CPU Fallback: Automatically falls back to CPU if GPU unavailable
- Apple Silicon: Use
FILTER_DEVICE=mpson macOS
Confidence Thresholds
- Text Prompts: Default
0.5works well - Exemplar-Based: Use
0.3for better recall - High Precision: Use
0.7or higher - High Recall: Use
0.3or lower
Development
Project Structure
filter-sam3-detector/
├── filter_sam3_detector/
│ ├── __init__.py
│ └── filter.py # Main filter implementation
├── scripts/ # Example usage scripts
│ ├── filter_object_detection.py # Video pipeline (text prompt)
│ ├── filter_object_detection_exemplar.py # Video pipeline (reference boxes + optional text)
│ └── run_temporal_intervals.py
├── examples/ # Additional examples
│ └── detect_objects_video.py
├── docs/ # Documentation
│ ├── API.md
│ ├── configuration.md
│ ├── advanced-usage.md
│ └── performance.md
├── tests/ # Test files
│ ├── test_filter.py
│ └── test_integration.py
├── sam3/ # Vendorized SAM3 library
├── env.example # Environment configuration example
└── pyproject.toml # Project dependencies
Key Dependencies
openfilter[all]>=0.1.0- Filter frameworktorch>=2.0.0- PyTorch for model inferencetorchvision>=0.15.0- Image processingtransformers>=4.40.0- HuggingFace model loadingopencv-python>=4.8.0- Image manipulationpillow>=10.0.0- Image processingnumpy>=1.24.0- Numerical operations
Testing
# Run tests
make test
# Run tests with coverage (pass extra pytest args via PYTEST_ARGS)
make test PYTEST_ARGS="--cov=filter_sam3_detector --cov-report=term"
# Check code quality
make lint
# Format code
make format
Known Issues
Exemplar-Based Detection Not Working
Status: Bug in _load_exemplar_images() - backbone output format handling is incorrect.
Symptoms: When using exemplars_path, you may see warnings like:
WARNING Failed to load exemplar example.jpg: 'NoneType' object is not subscriptable
ERROR No exemplar images could be loaded
Root Cause: The code at filter.py:853-858 doesn't properly handle the SAM3 backbone output format. The backbone returns features in a different structure than expected.
Workaround: Use text prompts (text_prompt) instead of exemplar images until this is fixed.
Tracking: This issue affects the few-shot learning functionality. Text-based detection works correctly.
Troubleshooting
Model Loading Issues
Problem: Model fails to load or takes too long
Solutions:
- Ensure you have sufficient GPU memory (recommended: 8GB+)
- Use CPU mode if GPU is unavailable:
--device cpu - Check internet connection (model downloads from HuggingFace on first use)
- Verify CUDA installation:
nvidia-smi
No Detections Found
Problem: Filter runs but finds no objects
Solutions:
- Lower confidence threshold:
--confidence 0.3 - Try different text prompts (be more specific or more general)
- For exemplar-based: ensure exemplar images are clear and representative
- Check that input video has the objects you're looking for
Out of Memory Errors
Problem: CUDA out of memory errors
Solutions:
- Resize input:
--resize 480 - Reduce max detections:
export FILTER_MAX_DETECTIONS=50 - Disable masks:
export FILTER_OUTPUT_MASKS=false - Use CPU mode:
--device cpu(slower but uses less memory)
Import Errors
Problem: ImportError: cannot import name 'FilterSAM3Detector'
Solutions:
- Ensure package is installed:
uv pip install -e . - Check Python version (requires 3.10+)
- Verify all dependencies are installed
- Reinstall:
uv pip install -e . --force-reinstall
Slow Processing
Problem: Processing is very slow
Solutions:
- Use GPU:
--device cuda - Resize videos:
--resize 480 - Reduce max detections
- Disable masks if not needed
- Process fewer frames (use sample rate in video input)
Performance Optimization
To improve processing speed:
- Use GPU acceleration (
FILTER_DEVICE=cuda) - Resize inputs to appropriate resolution (
--resize 480) - Limit detections (
FILTER_MAX_DETECTIONS=50) - Disable unused outputs (masks if not needed)
- Use smaller model variant (if available)
Documentation
For more detailed information, configuration examples, and advanced usage scenarios, see the comprehensive documentation:
- Installation Guide - Detailed installation instructions
- Quick Start Guide - Get started in minutes
- API Reference - Complete API documentation
- Configuration Guide - Configuration options
- Advanced Usage - Advanced patterns and examples
- Performance Tuning - Optimization guide
- Scripts Documentation - Example scripts usage
License
This project uses dual licensing. The filter wrapper code is licensed under Apache 2.0, and the vendorized SAM3 library (sam3/) is licensed under the SAM License, which includes trade control restrictions. See LICENSING.md for full details.
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filter_sam3_detector-0.1.16-py3-none-any.whl.
File metadata
- Download URL: filter_sam3_detector-0.1.16-py3-none-any.whl
- Upload date:
- Size: 71.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f858bda24f7a45cffb46bc4e1ee8e2c93522b604310b6b6c2497168bc3981a75
|
|
| MD5 |
15332a0f305dcad9889896e9e14ca7a9
|
|
| BLAKE2b-256 |
7e8000eecd9b5d77218134cb4b12d83dab125ff3b80ab75b45a3301db032f041
|