Local VLM inference engine for video — Apple Silicon, NVIDIA, and CPU

These details have not been verified by PyPI

Project links

Project description

TrioCore

The fastest local VLM inference engine for Apple Silicon

73% faster prefill · 1.7x frame-to-frame · 68% fewer tokens · Zero Docker

Install

# Apple Silicon (M1-M4)
pipx install 'trio-core[mlx]'         # CLI tool (recommended)
pip install 'trio-core[mlx]'          # or as library in your project

# NVIDIA / CPU
pipx install 'trio-core[transformers]'
pip install 'trio-core[transformers]'

# With webcam/camera support
pipx install 'trio-core[mlx,gui]'

Quick Start

CLI

# Check your hardware
trio device

# Analyze a video
trio analyze video.mp4 -q "What is happening?"

# Live camera monitor — define what to watch in plain English
trio webcam -w "a person is waving"

# Start the API server
trio serve

Live Camera Monitor

# Default: detect if someone is holding something
trio webcam

# Custom watch conditions — just describe what to look for
trio webcam -w "a person is waving"
trio webcam -w "no safety helmet"
trio webcam -w "package missing from doorstep"
trio webcam -w "someone entered the restricted area"

# iPhone as camera (macOS Continuity Camera)
trio webcam -s 1 -w "a person is waving"

# IP camera via RTSP
trio webcam -s "rtsp://admin:pass@192.168.1.100:554/stream" -w "intruder detected"

Auto-calibrates resolution for ~500ms inference on any Mac. Green = clear, red = alert with audio notification. No ML training needed — just describe what to monitor.

Python

from trio_core import TrioCore

engine = TrioCore()
engine.load()

result = engine.analyze_video("clip.mp4", "What is happening?")
print(result.text)  # "A person is walking through the parking lot..."
print(f"{result.metrics.latency_ms:.0f}ms, {result.metrics.tokens_per_sec:.0f} tok/s")

Auto-optimize (default)

TrioCore automatically applies benchmark-proven optimizations based on the loaded model. No configuration needed — just load and go.

engine = TrioCore()
engine.load()  # auto-applies optimal compression for your model

To disable: EngineConfig(auto_optimize=False)

Manual optimizations

from trio_core import TrioCore, EngineConfig

config = EngineConfig(
    tome_enabled=True,    # merge visual tokens inside ViT (-68% tokens)
    tome_r=4,
    tome_metric="hidden",
)
engine = TrioCore(config)
engine.load()

API Server

trio serve --port 8000

Analyze a frame

curl -X POST http://localhost:8000/analyze-frame \
  -H "Content-Type: application/json" \
  -d '{"frame_b64": "<base64 jpeg>", "question": "Is there a person at the door?"}'

{"answer": "Yes, there is a person standing at the door.", "triggered": true, "latency_ms": 487}

Analyze a video

curl -X POST http://localhost:8000/v1/video/analyze \
  -H "Content-Type: application/json" \
  -d '{"video": "video.mp4", "prompt": "What is happening?"}'

OpenAI-compatible chat

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": [
      {"type": "video", "video": "video.mp4"},
      {"type": "text", "text": "What is happening?"}
    ]}]
  }'

All endpoints

Endpoint	Method	Description
`/healthz`	GET	Health check
`/health`	GET	Detailed status + config
`/analyze-frame`	POST	Single frame: `{frame_b64, question}` → `{answer, triggered, latency_ms}`
`/v1/video/analyze`	POST	Video file analysis with metrics
`/v1/frames/analyze`	POST	Multi-frame upload (multipart)
`/v1/chat/completions`	POST	OpenAI-compatible (streaming SSE)
`/v1/models`	GET	Loaded model info

How It Works

TrioCore optimizes every stage of VLM inference. Each technique is independent and they compound:

Video → [Dedup] → [Motion Gate] → [ViT + ToMe] → [LLM + FastV] → [KV Reuse] → Answer
         -50%       -80% calls     -68% tokens    -50% tokens     1.7x reuse
        frames      when static    in encoder      in LLM         across frames

Stage	Technique	What it does	Speedup
Pre-inference	Temporal dedup	Skip near-identical frames (L2 on 64x64)	-50% frames
Pre-inference	Motion gate	Skip VLM entirely when scene is static	-80% calls
Vision encoder	ToMe	Merge similar visual tokens between ViT blocks	-73% prefill
LLM layers	FastV	Prune low-attention visual tokens from KV cache	-50% tokens
Cross-frame	KV Reuse	Reuse KV cache when frames are visually similar	1.7x speedup
Long video	StreamMem	Bounded KV cache with saliency eviction	constant memory

Benchmarks

Apple M3 Ultra, 4-bit quantized. Accuracy is hardware-independent (bit-identical output on any Apple Silicon). Latency scales proportionally across devices.

POPE — Object Hallucination (100 samples, yes/no)

Model	Params	Baseline	ToMe r=4	Compressed 50%	FastV
InternVL3-2B	2B	95%	—	94% (-1)	—
Qwen2.5-VL-3B	3B	94%	91% (-3)	75% (-19)	92% (-2)
Qwen3.5-2B	2B	94%	93% (-1)	93% (-1)	—
InternVL3-1B	1B	93%	—	94% (+1)	—
Qwen3.5-0.8B	0.8B	93%	94% (+1)	93% (0)	—
Qwen3-VL-2B	2B	92%	—	92% (0)	0%
Qwen3.5-9B	9B	92%	91% (-1)	90% (-2)	—
Qwen3-VL-8B	8B	91%	—	93% (+2)	75% (-16)
Qwen3-VL-4B	4B	91%	—	88% (-3)	85% (-6)
Qwen2.5-VL-7B	7B	90%	86% (-4)	90% (0)	✗
Qwen3.5-4B	4B	90%	89% (-1)	89% (-1)	—

TextVQA — OCR Reading (50 samples, open-ended)

Model	Params	Baseline	ToMe r=4	Compressed 50%	FastV
Qwen3.5-2B	2B	80%	78% (-2)	74% (-6)	—
InternVL3-2B	2B	78%	—	72% (-6)	—
Qwen3-VL-2B	2B	76%	—	76% (0)	66% (-10)
Qwen2.5-VL-3B	3B	72%	42% (-30)	60% (-12)	40% (-32)
Qwen3-VL-4B	4B	72%	—	72% (0)	56% (-16)
Qwen3.5-0.8B	0.8B	70%	64% (-6)	52% (-18)	—
Qwen3-VL-8B	8B	70%	—	70% (0)	54% (-16)
Qwen2.5-VL-7B	7B	66%	52% (-14)	68% (+2)	✗
Qwen3.5-9B	9B	56%	62% (+6)	56% (0)	—
Qwen3.5-4B	4B	52%	64% (+12)	52% (0)	—
InternVL3-1B	1B	50%	—	50% (0)	—

GQA — Visual Reasoning (50 samples, open-ended)

Model	Params	Baseline	ToMe r=4	Compressed 50%	FastV
Qwen3.5-2B	2B	68%	66% (-2)	68% (0)	—
InternVL3-2B	2B	66%	—	66% (0)	—
Qwen3-VL-4B	4B	66%	—	62% (-4)	50% (-16)
Qwen3.5-0.8B	0.8B	66%	60% (-6)	60% (-6)	—
InternVL3-1B	1B	62%	—	58% (-4)	—
Qwen2.5-VL-3B	3B	58%	54% (-4)	52% (-6)	42% (-16)
Qwen2.5-VL-7B	7B	58%	58% (0)	50% (-8)	—
Qwen3.5-4B	4B	58%	60% (+2)	64% (+6)	—
Qwen3.5-9B	9B	56%	64% (+8)	62% (+6)	—
Qwen3-VL-2B	2B	52%	—	58% (+6)	0%
Qwen3-VL-8B	8B	48%	—	54% (+6)	42% (-6)

MMBench — Multi-ability (50 samples, multiple choice)

Model	Params	Baseline	ToMe r=4	Compressed 50%	FastV
InternVL3-2B	2B	98%	—	96% (-2)	—
Qwen2.5-VL-7B	7B	96%	96% (0)	94% (-2)	—
Qwen3-VL-4B	4B	96%	—	94% (-2)	90% (-6)
Qwen3-VL-8B	8B	96%	—	94% (-2)	78% (-18)
Qwen3.5-9B	9B	96%	90% (-6)	96% (0)	—
Qwen2.5-VL-3B	3B	90%	82% (-8)	86% (-4)	66% (-24)
InternVL3-1B	1B	88%	—	86% (-2)	—
Qwen3-VL-2B	2B	84%	—	80% (-4)	2%
Qwen3.5-2B	2B	82%	82% (0)	82% (0)	—
Qwen3.5-0.8B	0.8B	58%	62% (+4)	54% (-4)	—
Qwen3.5-4B	4B	46%	44% (-2)	36% (-10)	—

MVBench — Video Understanding (12 tasks, 5 samples/task)

Model	Params	Baseline	Compressed 50%
Qwen3-VL-8B	8B	69%	57% (-12)
Qwen3.5-2B	2B	65%	57% (-8)
Qwen2.5-VL-7B	7B	63%	61% (-2)
Qwen3-VL-2B	2B	63%	54% (-9)
Qwen3-VL-4B	4B	63%	54% (-9)
Qwen2.5-VL-3B	3B	61%	59% (-2)
Qwen3.5-0.8B	0.8B	50%	46% (-4)
Qwen3.5-9B	9B	37%	37% (0)
Qwen3.5-4B	4B	2%	2% (0)
InternVL3	1-2B	—	—

— = architecturally incompatible (auto-skipped). ✗ = produces garbage output. ToMe incompatible with Qwen3-VL (deepstack) and InternVL3 (pixel shuffle). FastV incompatible with Qwen3.5 (DeltaNet), InternVL3, Qwen2.5-VL-7B (over-prunes), and Qwen3-VL-2B (garbage output). InternVL3 does not support multi-image/video inference (MVBench). Qwen3.5-4B: known 4-bit quantization issue on MCQ/video benchmarks (official FP16: MMBench 89%, our 4-bit: 46%).

Latency — ms/sample (POPE)

Model	Baseline	ToMe r=4	Compressed 50%	FastV	Best Speedup
Qwen3.5-0.8B	148ms	167ms	135ms	—	1.09x
Qwen3.5-2B	251ms	297ms	221ms	—	1.14x
Qwen3-VL-2B	275ms	—	223ms	226ms	1.23x
Qwen2.5-VL-3B	354ms	629ms	279ms	288ms	1.27x
Qwen3.5-4B	407ms	454ms	337ms	—	1.20x
Qwen3-VL-4B	414ms	—	335ms	341ms	1.24x
Qwen2.5-VL-7B	522ms	693ms	384ms	—	1.36x
Qwen3-VL-8B	633ms	—	503ms	516ms	1.26x
Qwen3.5-9B	632ms	694ms	506ms	—	1.25x
InternVL3-1B	677ms	—	577ms	—	1.17x
InternVL3-2B	967ms	—	736ms	—	1.31x

Frame-to-frame (KV cache reuse, 480p 5-frame video)

Model	Speedup	Architecture
Qwen2.5-VL-3B	1.57x	KV cache reuse
Qwen3-VL-4B	1.71x	KV cache reuse
Qwen3.5-0.8B	1.35x	DeltaNet state snapshot

Overhead vs mlx-vlm (raw generate loop)

Metric	mlx-vlm	trio-core
Prefill	1018ms	1016ms (-0.2%)
Decode	524ms	513ms (-2.1%)
Output	—	bit-identical

Supported Models

Tier 1 — Full optimization (native loading, all 4 stages)

Model	Size	4-bit	ToMe	FastV	Compressed	KV Reuse
Qwen2.5-VL 3B	3B	1.8G	✓	✓	✓	✓
Qwen2.5-VL 7B	7B	4.5G	✓	✗	✓	✓
Qwen3-VL 2B/4B/8B	2-8B	1.5-5.0G	—	✓	✓	✓
Qwen3.5 0.8B/2B/4B/9B	0.8-9B	0.5-5.0G	✓	—	✓	✓ (DeltaNet)
InternVL3 1B/2B	1-2B	1.0-1.6G	—	—	✓	✓

Tier 2 — Inference only (mlx-vlm, no optimization)

Gemma 3n, SmolVLM2, Phi-4, Gemma 3, FastVLM, and any model supported by mlx-vlm.

Configuration

All settings via TRIO_ environment variables or EngineConfig:

TRIO_MODEL=mlx-community/Qwen2.5-VL-3B-Instruct-4bit
TRIO_TOME_ENABLED=true
TRIO_TOME_R=4
TRIO_PORT=8000

See trio_core/config.py for all options.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.5

May 18, 2026

1.1.4

Apr 6, 2026

1.1.3

Apr 3, 2026

1.1.2

Apr 3, 2026

1.1.1

Apr 3, 2026

1.1.0

Apr 2, 2026

0.9.0

Mar 27, 2026

0.8.3

Mar 10, 2026

0.8.2

Mar 10, 2026

0.8.1

Mar 10, 2026

0.8.0

Mar 10, 2026

0.7.1

Mar 9, 2026

0.7.0

Mar 9, 2026

0.6.0

Mar 9, 2026

This version

0.4.4

Mar 9, 2026

0.4.3

Mar 9, 2026

0.4.2

Mar 9, 2026

0.4.1

Mar 8, 2026

0.4.0

Mar 8, 2026

0.3.0

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trio_core-0.4.4.tar.gz (1.2 MB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trio_core-0.4.4-py3-none-any.whl (147.6 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file trio_core-0.4.4.tar.gz.

File metadata

Download URL: trio_core-0.4.4.tar.gz
Upload date: Mar 9, 2026
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for trio_core-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`2fcdd18af1756243e7a4b3a134bca7ad597e92c9aa4304dbc32b6c86a8ba1c45`
MD5	`752f53594496c2656acf0adf0c67c80c`
BLAKE2b-256	`957e3cb35c074d0340b9b9c6c378c81d633a3d991a8140aa1fec29059bb92927`

See more details on using hashes here.

File details

Details for the file trio_core-0.4.4-py3-none-any.whl.

File metadata

Download URL: trio_core-0.4.4-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 147.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for trio_core-0.4.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`076fe66bb599e2900eba5d94c7b8e16aea9768626bba180e1aea5f3842e0600a`
MD5	`88387c8813059cb4cc342e623ff7145e`
BLAKE2b-256	`72debe473b7f1d41548dcdc7900dd04e67087d2ada3159c4da1840592b0f2b40`

See more details on using hashes here.

trio-core 0.4.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TrioCore

Install

Quick Start

CLI

Live Camera Monitor

Python

Auto-optimize (default)

Manual optimizations

API Server

Analyze a frame

Analyze a video

OpenAI-compatible chat

All endpoints

How It Works

Benchmarks

POPE — Object Hallucination (100 samples, yes/no)

TextVQA — OCR Reading (50 samples, open-ended)

GQA — Visual Reasoning (50 samples, open-ended)

MMBench — Multi-ability (50 samples, multiple choice)

MVBench — Video Understanding (12 tasks, 5 samples/task)

Latency — ms/sample (POPE)

Frame-to-frame (KV cache reuse, 480p 5-frame video)

Overhead vs mlx-vlm (raw generate loop)

Supported Models

Tier 1 — Full optimization (native loading, all 4 stages)

Tier 2 — Inference only (mlx-vlm, no optimization)

Configuration

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes