Skip to main content

Efficient on-edge V-JEPA 2.x video encoder with a streaming/causal R&D track for temporal video understanding.

Project description

Saccade

Gemini_Generated_Image_jtp6uvjtp6uvjtp6

PyPI Python versions License: MIT

Most of a video is redundant (a hallway camera sees nearly the same frame thousands of times), yet a video model normally pays full price for every frame. That makes running V-JEPA 2 live on edge hardware (cameras, robots, on-device apps) expensive.

Saccade fixes that by spending compute only on what changes, the way your eyes spend detail on fixations and predict across the jumps in between (a saccade). It turns a frozen V-JEPA 2 encoder into a streaming model that keeps a live embedding cheaply.

How it works

Saccade adds two things to V-JEPA's encoder:

  1. A streaming, causal encoder. Attention is made block-causal and backed by a per-layer KV-cache, so a new frame is encoded once and reuses cached history instead of re-running a sliding window. V-JEPA 2's 3D rotary embeddings are ported into the causal path, so the cached step reproduces the original full-attention output exactly, not an approximation.
  2. Surprise-gating. A cheap novelty test (the encoder's own patch-embedding front-end) decides whether an incoming clip is actually new. Predictable clips skip the transformer and reuse the last representation; only real changes pay full price. Compute follows the scene, not the clock.

Surprise-gated streaming encoder

Left: the learned gate holds fidelity far better than a pixel-difference gate as compute drops. Right: encoder compute auto-scales from ~2% on static video to 100% on fast motion.

Around that core sits a measured edge toolkit: post-training quantization (int8/int4), token reduction (ToMe, PruneVid-style temporal merge), fused attention + torch.compile, distillation to a smaller ViT-S student, ONNX export, and an async decode-and-infer pipeline.

Results

Measured on an RTX 5070 Ti, fp16, batch 1. A single GPU, not the Jetson target, so read these as a correctness check and an upper bound on edge performance.

Efficiency and streaming

What Result
ViT-L encoder (16f @256) 95.7 ms, 10.5 embeds/s, 738 MB
Surprise-gated streaming (real video) 84% of embeddings skipped, 5.7x faster
Streaming per-frame update 22.8 ms vs 188.8 ms full re-encode = 8.3x
RoPE-port correctness causal attention matches HF to rel 0.001; cache step exact (0.0)
Fused attention SDPA 3.94x, SDPA + torch.compile 4.53x vs eager (374 -> 82 ms)
int8 quantization 30% less memory at cosine 0.9999
Token reduction 1.3x to 2.3x speedup (accuracy/speed knob)
ONNX export exact parity vs PyTorch (cosine 1.00000)

Install

Requires Python 3.10+ and a CUDA GPU.

With uv (recommended; pulls the cu128 torch build for Blackwell/sm_120 automatically, configured in pyproject.toml):

uv sync                # creates .venv and installs deps (incl. cu128 torch)
uv sync --extra dev    # add test + figure tooling (pytest, matplotlib, seaborn)

With pip (the PyPI name is saccadic; it imports as saccade):

pip install saccadic                                    # latest release from PyPI
pip install git+https://github.com/Khushiyant/saccade   # bleeding edge from main
# or, from a clone, for development:
pip install -e .

On recent GPUs (Blackwell/sm_120) install the matching torch first, so pip does not pull an incompatible build: pip install torch --index-url https://download.pytorch.org/whl/cu128. On Jetson, use the JetPack-provided torch/decord/tensorrt wheels.

Usage

Saccade is a library: it turns video into embeddings you feed to your own head (a classifier, retrieval index, anomaly score). Pick the mode that matches your input.

One-shot, when you have a clip and want its embedding:

import torch
from saccade import load_encoder, ModelConfig

enc = load_encoder(ModelConfig(checkpoint="vitl", frames=16, resolution=256,
                               device="cuda", dtype="float16"))
clip = torch.rand(1, 16, 3, 256, 256, device="cuda", dtype=torch.float16)  # [B,T,C,H,W]
emb = enc.embed(clip)        # [1, 1024] -> feed to your task head

Surprise-gated streaming, when you have a live feed and want to skip redundant clips (the efficiency win: ~84% of clips skipped on real footage):

from saccade import SurpriseGatedEncoder

gate = SurpriseGatedEncoder(enc, tau=0.015)   # tau is the compute/fidelity knob
gate.reset()
for clip in stream:                           # each clip: [1, T, 3, H, W]
    emb, info = gate.step(clip)
    if info["encoded"]:                       # False -> scene unchanged, last emb reused
        my_head(emb)                          # only run downstream work when it is new

Exact causal streaming, when you want a per-frame running embedding backed by a KV-cache:

from saccade import StreamingEncoder, StreamingConfig

stream = StreamingEncoder(enc, StreamingConfig())
stream.reset()
for frame in frames:                          # each frame: [3, H, W]
    emb = stream.step(frame)                  # emits a 1024-d embedding once a tubelet completes

To finetune the causal adapter on your own video, apply_causal_lora(enc, StreamingConfig()) converts the encoder in place and returns the trainable LoRA parameters.

In every mode emb is a 1024-d vector; attach your own linear probe, retrieval, or threshold on top. Saccade gives you the cheap live representation, the task head is yours.

Reproduce

The numbers above come from these scripts (run on an RTX 5070 Ti):

uv run python scripts/real_eval.py            # encoder latency + streaming
uv run python scripts/bench_fused_attn.py     # eager vs SDPA vs torch.compile
uv run python scripts/bench_surprise_gate.py  # surprise-gating Pareto
uv run python scripts/verify_rope.py          # RoPE-port correctness checks
uv run python scripts/make_figures.py         # render result figures
uv run python scripts/demo.py --video clip.mp4 --stride 4 --tau 0.015  # annotated demo
uv run pytest                                 # unit tests

Layout: the library lives in src/saccade/ (with streaming/ for the causal attention, KV-cache, LoRA-to-causal, streaming encoder and surprise gate); scripts/ holds the benchmarks and demo; tests/ the unit tests; configs/ example run configs.

Status and limitations

Measured and verified:

  • Encoder latency/throughput/memory, fused attention, token reduction, int8 quantization.
  • Streaming: the KV-cache step reproduces masked full attention exactly; the ported 3D-RoPE matches the reference encoder to ~0.1%.
  • 37 unit tests pass (core correctness plus the novel features); distillation and the robustness finetune train on the real model.

Not yet done (needs external resources, not code):

  • Task accuracy. SSv2 top-1 has not been run (the dataset is gated). The probe train/eval harness works on a synthetic proxy; there are no accuracy-vs-SOTA numbers yet.
  • On-device. Only a single GPU was used; Jetson latency and a TensorRT engine still need the actual device.
  • Streaming accuracy. The causal encoder is numerically exact through the cache, but a LoRA finetune on real video is still needed to close the across-depth causal-vs-bidirectional gap.

References

  • V-JEPA 2, Assran et al., 2025 (arXiv:2506.09985). Checkpoints facebook/vjepa2-* on Hugging Face; Saccade loads facebook/vjepa2-vitl-fpc64-256 by default.
  • Closest streaming prior art: VL-JEPA, OmniStream, Recurrent Video MAE, CarelessWhisper.

License

MIT, see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saccadic-0.1.0.tar.gz (78.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saccadic-0.1.0-py3-none-any.whl (78.6 kB view details)

Uploaded Python 3

File details

Details for the file saccadic-0.1.0.tar.gz.

File metadata

  • Download URL: saccadic-0.1.0.tar.gz
  • Upload date:
  • Size: 78.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for saccadic-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5fcd45b30a8bce70ab21c787bbf60f4c567cf79168539cb2cc49a05feb091864
MD5 0b94d35a9b844ae05f104368d77bc815
BLAKE2b-256 2583f1a92852342482cd0bc0ff13e343f993272c1a77eb1921137a31d98a889e

See more details on using hashes here.

Provenance

The following attestation bundles were made for saccadic-0.1.0.tar.gz:

Publisher: release.yml on Khushiyant/saccade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file saccadic-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: saccadic-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 78.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for saccadic-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f771040fbd709ef26c4d69fe6744471f244d4b58b1cdab73ea900f2f5321ae97
MD5 2d8b775335d604fcb1dfb743aa17d3f2
BLAKE2b-256 9f955045220b2d03ff0ec9d2bf319a8ad8a89c081b40ad207575eee8625c508f

See more details on using hashes here.

Provenance

The following attestation bundles were made for saccadic-0.1.0-py3-none-any.whl:

Publisher: release.yml on Khushiyant/saccade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page