Skip to main content

Local VLM inference engine for video — Apple Silicon, NVIDIA, and CPU

Project description

TrioCore

Real-time Vision Intelligence Engine for Apple Silicon

YOLO object detection + VLM scene understanding. One pip install, zero Docker.

PyPI Python License Stars

Quick Start | Install | API | CLI | SDK | Benchmarks | Architecture | Troubleshooting


What is TrioCore?

Point it at any image, video, or camera and it will detect objects, count people, and describe scenes — all running locally on your Mac, no cloud APIs needed.

Core capabilities:

  • Detect — Find and count objects (people, cars, etc.) in images
  • Describe — Get natural language descriptions of what's happening in a scene
  • Crop-Describe — Detect objects, then describe each one individually
  • REST API — Built-in web server on port 8100 with interactive docs
  • CLI — Simple commands: trio serve, trio analyze, trio webcam
New to computer vision? Key terms explained
Term What it means
YOLO "You Only Look Once" — a fast object detection model that finds and labels objects in images
VLM Vision Language Model — an AI model that can look at an image and describe it in natural language
MLX Apple's machine learning framework, optimized for M1/M2/M3/M4 chips
ONNX A standard format for ML models that runs on any hardware
ToMe Token Merging — a technique that makes VLM inference faster by reducing redundant data
KV cache A memory optimization that speeds up processing of sequential video frames

Quick Start

# 1. Install (Apple Silicon Mac recommended)
pip install 'trio-core[mlx]'

# 2. Check your setup
trio doctor

# 3. Start the server
trio serve

First run note: The first time you run trio serve or trio analyze, the model will be downloaded automatically (~2 GB for the default 3B model). This takes 5-20 minutes depending on your connection. Subsequent runs start instantly.

Once the server is running, open http://localhost:8100/docs in your browser to explore the API interactively, or try it from the terminal:

# In another terminal — grab any image and detect objects in it
# macOS:
curl -X POST http://localhost:8100/api/inference/detect \
  -H "Content-Type: application/json" \
  -d '{"image_b64": "'$(base64 -i your-photo.jpg)'"}'

# Linux:
curl -X POST http://localhost:8100/api/inference/detect \
  -H "Content-Type: application/json" \
  -d '{"image_b64": "'$(base64 -w0 your-photo.jpg)'"}'
{
  "people_count": 3,
  "vehicle_count": 1,
  "by_class": {"person": 3, "car": 1},
  "crops_b64": [{"class": "person", "bbox": [100, 50, 200, 300], "confidence": 0.92}],
  "elapsed_ms": 45
}

Or analyze an image directly from the CLI (no server needed):

trio analyze your-photo.jpg -q "How many people are in this image?"

See more in examples/quickstart.py (5 lines) and api_client.py (full API usage).


Install

Requires Python 3.10+.

# Apple Silicon Mac (M1/M2/M3/M4) — recommended, uses Apple's MLX framework
pip install 'trio-core[mlx]'

# Apple Silicon + webcam monitoring
pip install 'trio-core[mlx,webcam]'

# NVIDIA GPU or CPU-only (uses PyTorch/Transformers instead of MLX)
pip install 'trio-core[transformers]'

# For IP/RTSP camera support (macOS)
brew install ffmpeg

Which install do I pick? If you have a Mac with Apple Silicon (2020 or later), use [mlx]. If you have an NVIDIA GPU or are on Linux, use [transformers]. Not sure? Run trio device after install to see what hardware was detected.


API Reference

Tip: Once the server is running, visit http://localhost:8100/docs for interactive API documentation where you can try every endpoint from your browser.

Start the server:

trio serve                          # default: 0.0.0.0:8100
trio serve --port 9000              # custom port
TRIO_API_KEY=secret trio serve      # enable Bearer token auth

POST /api/inference/detect

Run YOLO object detection. Returns counts and bounding boxes.

curl -X POST http://localhost:8100/api/inference/detect \
  -H "Content-Type: application/json" \
  -d '{"image_b64": "<base64 jpeg>", "pad_ratio": 0.15}'

Response:

{
  "people_count": 2,
  "vehicle_count": 1,
  "by_class": {"person": 2, "car": 1},
  "crops_b64": [
    {"class": "person", "bbox": [100, 50, 200, 300], "confidence": 0.92},
    {"class": "car", "bbox": [400, 200, 600, 350], "confidence": 0.87}
  ],
  "elapsed_ms": 42
}

POST /api/inference/describe

Run VLM on an image. Returns natural language description.

curl -X POST http://localhost:8100/api/inference/describe \
  -H "Content-Type: application/json" \
  -d '{"image_b64": "<base64 jpeg>", "prompt": "Describe what you see."}'

Response:

{
  "description": "A woman in a red jacket is walking a golden retriever along a tree-lined sidewalk.",
  "elapsed_ms": 380
}

POST /api/inference/crop-describe

Combined pipeline: YOLO detects objects, crops them, then VLM describes each entity individually before generating a full scene description.

curl -X POST http://localhost:8100/api/inference/crop-describe \
  -H "Content-Type: application/json" \
  -d '{
    "image_b64": "<base64 jpeg>",
    "crops": [
      {"class": "person", "bbox": [100, 50, 200, 300], "confidence": 0.92}
    ],
    "max_crops": 3
  }'

Response:

{
  "description": "1 person: male 30s, blue polo, carrying laptop bag",
  "entities": {"persons": [...], "vehicles": [...]},
  "crop_descriptions": ["person: male 30s, blue polo, carrying laptop bag"],
  "elapsed_ms": 520
}

GET /api/inference/status

Check which models are loaded.

GET /health

Health check with uptime.


CLI

trio doctor                             # Check setup — run this first!
trio device                             # Show your hardware + recommended model
trio serve                              # Start inference API server (port 8100)
trio analyze photo.jpg -q "What's here?" # Analyze an image (no server needed)
trio analyze video.mp4 -q "Describe"    # Video analysis
trio webcam -w "a person is waving"     # Live webcam monitor with alerts
trio cam --host 192.168.1.100 -p pass   # IP camera monitor
trio bench video.mp4 -n 5              # Benchmark inference speed

trio analyze

trio analyze photo.jpg -q "How many people are in this image?"
trio analyze video.mp4 -q "Describe the scene" --json    # JSON output with metrics
trio analyze photo.jpg -m mlx-community/Qwen2.5-VL-7B-Instruct-4bit  # specific model

trio webcam

Live camera monitor with VLM-based alerting. Green = clear, red = alert with audio.

trio webcam -w "someone at the door"         # Built-in webcam
trio webcam -s 1 -w "package on doorstep"    # iPhone Continuity Camera
trio webcam --count                          # Count objects (cumulative)

Python SDK

from trio_core import TrioCore, EngineConfig

# Load with defaults (auto-selects best model for your hardware)
engine = TrioCore()
engine.load()

# Analyze an image or video
result = engine.analyze_video("photo.jpg", "What do you see?")
print(result.text)
print(f"{result.metrics.latency_ms:.0f}ms | {result.metrics.tokens_per_sec:.0f} tok/s")

Configuration

config = EngineConfig(
    model="mlx-community/Qwen2.5-VL-3B-Instruct-4bit",
    tome_enabled=True,       # Token Merging — 73% fewer visual tokens
    tome_r=4,
)
engine = TrioCore(config)

Or via environment variables:

TRIO_MODEL=mlx-community/Qwen2.5-VL-3B-Instruct-4bit
TRIO_TOME_ENABLED=true
TRIO_TOME_R=4

Supported Models

Tier 1 — Full optimization (native loading + visual token compression + KV reuse)

Model Params 4-bit VRAM ToMe Compressed KV Reuse
Qwen2.5-VL 3B, 7B 1.8-4.5G yes yes yes
Qwen3-VL 2B, 4B, 8B 1.5-5.0G -- yes yes
Qwen3.5 0.8-9B 0.5-5.0G yes yes yes
InternVL3 1B, 2B 1.0-1.6G -- yes yes

Tier 2 — Inference only (via mlx-vlm)

Gemma 3n, SmolVLM2, Phi-4, FastVLM, and any model supported by mlx-vlm.


Benchmarks

All benchmarks on Apple M3 Ultra, 4-bit quantized models. Accuracy is hardware-independent.

Inference Latency (POPE benchmark, ms/sample)

Model Params Baseline Compressed 50% Speedup
Qwen3.5-0.8B 0.8B 148ms 135ms 1.09x
Qwen3.5-2B 2B 251ms 221ms 1.14x
Qwen3-VL-2B 2B 275ms 223ms 1.23x
Qwen2.5-VL-3B 3B 354ms 279ms 1.27x
Qwen2.5-VL-7B 7B 522ms 384ms 1.36x
Qwen3-VL-8B 8B 633ms 503ms 1.26x

Frame-to-Frame KV Cache Reuse

Model Speedup Method
Qwen3-VL-4B 1.71x KV cache reuse
Qwen2.5-VL-3B 1.57x KV cache reuse
Qwen3.5-0.8B 1.35x DeltaNet state snapshot

Overhead vs raw mlx-vlm

Metric mlx-vlm trio-core Delta
Prefill 1018ms 1016ms -0.2%
Decode 524ms 513ms -2.1%
Output -- bit-identical --
Full accuracy benchmarks (11 models x 6 benchmarks)

POPE — Object Hallucination (100 samples)

Model Baseline Compressed 50%
InternVL3-2B 95% 94%
Qwen2.5-VL-3B 94% 75%
Qwen3.5-2B 94% 93%
Qwen3-VL-8B 91% 93%

TextVQA — OCR Reading (50 samples)

Model Baseline Compressed 50%
Qwen3.5-2B 80% 74%
InternVL3-2B 78% 72%
Qwen3-VL-2B 76% 76%

GQA — Visual Reasoning (50 samples)

Model Baseline Compressed 50%
Qwen3.5-2B 68% 68%
InternVL3-2B 66% 66%
Qwen3.5-4B 58% 64%

MMBench — Multi-ability (50 samples)

Model Baseline Compressed 50%
InternVL3-2B 98% 96%
Qwen2.5-VL-7B 96% 94%
Qwen3.5-9B 96% 96%

SurveillanceVQA — Anomaly Detection (1,827 samples)

Model Accuracy F1 Recall
Qwen2.5-VL-7B 70.1% 0.362 25.3%
Qwen3-VL-8B 69.0% 0.395 30.2%
Qwen3.5-4B 65.2% 0.556 65.1%

Architecture

                           TrioCore
                              |
              +---------------+---------------+
              |                               |
         YOLO Pipeline                   VLM Pipeline
              |                               |
    YOLOv10n ONNX model              Qwen/InternVL (MLX)
    tiled 2x2 detection              native model loading
    ByteTrack tracking               ToMe token compression
              |                       KV cache reuse
              |                               |
              +---------------+---------------+
                              |
                    FastAPI Server (:8100)
                              |
              +-------+-------+-------+
              |       |       |       |
          /detect  /describe  /crop   /status
                              -describe

Key design decisions

  • No ultralytics — YOLOv10 loaded via ONNX Runtime (MIT license)
  • Native VLM loading — Vendored model code (~3600 lines), bit-identical with mlx-vlm, zero overhead
  • Visual token compression — ToMe merges similar visual tokens in the ViT, reducing prefill by up to 73%
  • KV cache reuse — For sequential frames, reuse KV cache from previous frame (1.7x speedup)
  • Lazy loading — Models loaded on first request, not at server start

Configuration

All settings via environment variables or EngineConfig:

Variable Default Description
TRIO_MODEL Auto-detected HuggingFace model ID
TRIO_TOME_ENABLED false Enable Token Merging
TRIO_TOME_R 4 Tokens merged per ViT block
TRIO_COMPRESS_ENABLED false Enable visual token compression
TRIO_COMPRESS_RATIO 0.5 Compression ratio
TRIO_API_KEY (none) Bearer token for API auth
TRIO_YOLO_MODEL (bundled) Path to YOLO ONNX model

See src/trio_core/config.py for all options.


OpenClaw Integration

TrioCore can connect to an OpenClaw Gateway as a node for remote camera monitoring via WebSocket.

pip install 'trio-core[claw]'
trio claw --pair -g ws://gateway:18789 --token <secret>
trio claw -g ws://gateway:18789 -c "rtsp://admin:pass@camera/stream"

Troubleshooting

Problem Solution
trio serve hangs on first run It's downloading the model (~2 GB). Wait for it to finish. Check progress with ls -la ~/.cache/huggingface/
ModuleNotFoundError: mlx You installed without the [mlx] extra. Run pip install 'trio-core[mlx]'
Server starts but curl returns errors Make sure you're using port 8100 (not 8000). Check with curl http://localhost:8100/health
trio analyze says "no model found" Run trio doctor to check your setup and see which models are available
Out of memory on large images Try a smaller model: trio serve defaults to a 3B model (~2 GB RAM). The 7B model needs ~5 GB
Webcam not detected On macOS, grant Terminal camera access in System Settings > Privacy > Camera

Run trio doctor to diagnose most issues — it checks Python version, dependencies, hardware, and available models.


References

  • ToMe — Bolya et al., "Token Merging: Your ViT But Faster", ICLR 2023. arXiv:2210.09461
  • StreamMem — Du et al., "Streaming KV Cache Management for Video Understanding", 2025. arXiv:2504.08498
  • SurveillanceVQA-589K — Zheng et al., 2025. arXiv:2505.12589

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trio_core-0.9.0.tar.gz (18.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trio_core-0.9.0-py3-none-any.whl (186.2 kB view details)

Uploaded Python 3

File details

Details for the file trio_core-0.9.0.tar.gz.

File metadata

  • Download URL: trio_core-0.9.0.tar.gz
  • Upload date:
  • Size: 18.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trio_core-0.9.0.tar.gz
Algorithm Hash digest
SHA256 c67b96dab59504c488d61c541c5ff8fe711681e6e3a2d916e85b2f796ce55489
MD5 3726addb56926c5e8e0c8277b82636c6
BLAKE2b-256 da2139dd3081434374c6ea7c97c73d0e93b542d68981f67116d63f9ff77aad9a

See more details on using hashes here.

File details

Details for the file trio_core-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: trio_core-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 186.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trio_core-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bef2f9e825f0fa937ae9794e7ce477d8f8f9095112b6c672af1eb0d13ba154cd
MD5 7fef0c24b88949f8c3e7badce4748683
BLAKE2b-256 a782a7c8c9f6def38796dc64eb29a09fbb5a458940b54e7c049df9d266d1398a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page