Local VLM inference engine for video — Apple Silicon, NVIDIA, and CPU
Project description
TrioCore
The fastest local VLM inference engine for Apple Silicon
73% faster prefill · 1.7x frame-to-frame · 68% fewer tokens · Zero Docker
Install | Quick Start | API Server | How It Works | Benchmarks | Models
Install
pip install 'trio-core[mlx]' # Apple Silicon (M1-M4)
pip install 'trio-core[transformers]' # NVIDIA / CPU
Quick Start
CLI
# Check your hardware
trio device
# Analyze a video
trio analyze video.mp4 -q "What is happening?"
# Start the API server
trio serve
Python
from trio_core import TrioCore
engine = TrioCore()
engine.load()
result = engine.analyze_video("clip.mp4", "What is happening?")
print(result.text) # "A person is walking through the parking lot..."
print(f"{result.metrics.latency_ms:.0f}ms, {result.metrics.tokens_per_sec:.0f} tok/s")
With optimizations
from trio_core import TrioCore, EngineConfig
config = EngineConfig(
tome_enabled=True, # merge visual tokens inside ViT (-68% tokens)
tome_r=4,
tome_metric="hidden",
)
engine = TrioCore(config)
engine.load()
API Server
trio serve --port 8000
Analyze a frame
curl -X POST http://localhost:8000/analyze-frame \
-H "Content-Type: application/json" \
-d '{"frame_b64": "<base64 jpeg>", "question": "Is there a person at the door?"}'
{"answer": "Yes, there is a person standing at the door.", "triggered": true, "latency_ms": 487}
Analyze a video
curl -X POST http://localhost:8000/v1/video/analyze \
-H "Content-Type: application/json" \
-d '{"video": "video.mp4", "prompt": "What is happening?"}'
OpenAI-compatible chat
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": [
{"type": "video", "video": "video.mp4"},
{"type": "text", "text": "What is happening?"}
]}]
}'
All endpoints
| Endpoint | Method | Description |
|---|---|---|
/healthz |
GET | Health check |
/health |
GET | Detailed status + config |
/analyze-frame |
POST | Single frame: {frame_b64, question} → {answer, triggered, latency_ms} |
/v1/video/analyze |
POST | Video file analysis with metrics |
/v1/frames/analyze |
POST | Multi-frame upload (multipart) |
/v1/chat/completions |
POST | OpenAI-compatible (streaming SSE) |
/v1/models |
GET | Loaded model info |
How It Works
TrioCore optimizes every stage of VLM inference. Each technique is independent and they compound:
Video → [Dedup] → [Motion Gate] → [ViT + ToMe] → [LLM + FastV] → [KV Reuse] → Answer
-50% -80% calls -68% tokens -50% tokens 1.7x reuse
frames when static in encoder in LLM across frames
| Stage | Technique | What it does | Speedup |
|---|---|---|---|
| Pre-inference | Temporal dedup | Skip near-identical frames (L2 on 64x64) | -50% frames |
| Pre-inference | Motion gate | Skip VLM entirely when scene is static | -80% calls |
| Vision encoder | ToMe | Merge similar visual tokens between ViT blocks | -73% prefill |
| LLM layers | FastV | Prune low-attention visual tokens from KV cache | -50% tokens |
| Cross-frame | KV Reuse | Reuse KV cache when frames are visually similar | 1.7x speedup |
| Long video | StreamMem | Bounded KV cache with saliency eviction | constant memory |
Benchmarks
Apple M3 Pro, 4-bit quantized.
Prefill speed (1080p single frame)
| Model | Baseline | + ToMe r=4 | Speedup |
|---|---|---|---|
| Qwen2.5-VL-3B | 1,808ms (748 tokens) | 490ms (242 tokens) | 3.7x |
Quality (POPE, 100 images)
| Model | Baseline | + ToMe r=4 |
|---|---|---|
| Qwen2.5-VL-3B | 92% | 81% |
| Qwen3-VL-4B | 91% | 91% (zero loss) |
Frame-to-frame (480p, 5-frame video)
| Model | Speedup | Architecture |
|---|---|---|
| Qwen2.5-VL-3B | 1.57x | KV cache reuse |
| Qwen3-VL-4B | 1.71x | KV cache reuse |
| Qwen3.5-0.8B | 1.35x | DeltaNet state snapshot |
Overhead vs mlx-vlm (raw generate loop)
| Metric | mlx-vlm | trio-core |
|---|---|---|
| Prefill | 1018ms | 1016ms (-0.2%) |
| Decode | 524ms | 513ms (-2.1%) |
| Output | — | bit-identical |
Supported Models
Tier 1 — Full optimization (native loading, all 4 stages)
| Model | Size | 4-bit | ToMe | FastV | KV Reuse | StreamMem |
|---|---|---|---|---|---|---|
| Qwen2.5-VL 3B/7B | 3-7B | 1.8-4.5G | ✓ | ✓ | ✓ | ✓ |
| Qwen3-VL 2B/4B/8B | 2-8B | 1.5-5.0G | ✓ | ✓ | ✓ | ✓ |
| Qwen3.5 0.8B/2B/4B/9B | 0.8-9B | 0.5-5.0G | ✓ | ✓ | ✓ (DeltaNet) | ✓ |
| InternVL3 1B/2B | 1-2B | 1.0-1.6G | — | ✓ | ✓ | ✓ |
| nanoLLaVA-1.5 | 1B | 1.0G | ✓ | ✓ | ✓ | ✓ |
Tier 2 — Inference only (mlx-vlm, no optimization)
Gemma 3n, SmolVLM2, Phi-4, Gemma 3, FastVLM, and any model supported by mlx-vlm.
Configuration
All settings via TRIO_ environment variables or EngineConfig:
TRIO_MODEL=mlx-community/Qwen2.5-VL-3B-Instruct-4bit
TRIO_TOME_ENABLED=true
TRIO_TOME_R=4
TRIO_PORT=8000
See trio_core/config.py for all options.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trio_core-0.3.0.tar.gz.
File metadata
- Download URL: trio_core-0.3.0.tar.gz
- Upload date:
- Size: 305.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
753ab5f1640e1c39dc763599ef8a504d65f3617a154184ca75eabcb4d9e4a1b8
|
|
| MD5 |
e72b8843ca40d11b210f8633d210da1c
|
|
| BLAKE2b-256 |
0fddc6deea2955aa6d474b2e098f0096e2a05e861133cce65cf6c84027b7458e
|
File details
Details for the file trio_core-0.3.0-py3-none-any.whl.
File metadata
- Download URL: trio_core-0.3.0-py3-none-any.whl
- Upload date:
- Size: 123.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6951b3bca81787b7b7e292346072f4b4c76c41d0558b2d0861fb2bf6b6674670
|
|
| MD5 |
d8fe49ba849d880e12171b77b48d59a1
|
|
| BLAKE2b-256 |
29d08beb492418a59b2350b79e495c22a5d9236ce346474eb2f31e50e5e8d203
|