Skip to main content

Research preview — the open-source cognition layer for goal-driven, proactive vision agents.

Project description

percept

The open-source cognition layer for goal-driven, proactive vision agents.

You state a goal in plain language — "nudge me when I drink a coffee", "tell me when the kettle boils" — and percept turns a live audio-visual stream into an agent that:

  • reasons over entities across time — tracked things with stable ids, attributes carrying provenance (observed vs you-told-me) and freshness;
  • fires proactively, but only on the rising edge of a condition becoming true;
  • refuses to guess — a three-state gate maps known → act · not → silent · unknown → ask, so it never delivers a confidently-wrong nag.

The wedge is temporal cognition (entity memory + the three-state gate + reasoning over events), which a raw VLM-in-a-loop and a per-frame pipeline both lack. percept builds on frontier models for perception behind vendor-neutral seams.

⚠️ Research preview — v0.1.0. Published for real-life testing and feedback, not for production. The cognition core (gate, entity graph, executor, events, scheduler, consent) runs and is tested; the benchmark card below is a v0.5 DRAFT. APIs may change between 0.1.x releases — pin a version. Issues and feedback welcome. (Status.)

The envelope (refusals, stated proudly)

  • Assistant-class, never the safety mechanism. percept informs a human who stays responsible. It is never the thing standing between a person and harm on a clock it can't guarantee. Out: driver-drowsiness intervention, turn detection. In: a post-hoc driving debrief.
  • Watches the user's own world, with the user as beneficiary — never a non-consenting third party to the wearer's advantage. Out: covert analysis of someone you're negotiating with; card-counting against a live casino. In: a play-money practice trainer.

These refusals are a feature. Do not lead with surveillance demos.

Install

pip install percept-vision        # core: PURE STDLIB — runs offline with fake backends, no keys

The core has zero dependencies and runs with deterministic fake backends, so pip install then run works with no API keys. Frontier backends are opt-in extras:

pip install "percept-vision[gemini]"     # GeminiVision
pip install "percept-vision[claude]"     # AnthropicVision (Claude)
pip install "percept-vision[deepgram]"   # Deepgram STT + TTS (voice)
pip install "percept-vision[audio]"      # numpy fast-path for acoustic/edge
pip install "percept-vision[all]"        # everything

Python >= 3.10. The package name is percept-vision; the import is import percept.

60-second quickstart (no keys)

Fully offline — fake backends, deterministic, nothing to configure. This script runs as-is:

import asyncio
from percept import Percept, Goal

async def main():
    # Fake backends by default — offline, no keys. discover_plugins=False skips plugin lookup.
    agent = Percept.create(discover_plugins=False)

    agent.add_goal(Goal(
        id="caffeine",
        condition="the user is drinking coffee",
        say="Heads up — stepping back from caffeine?",
    ))

    # Each frame is judged; the gate fires ONCE on the rising edge.
    # ("sip-coffee" is a token the fake vision backend scripts as a confident YES.)
    fires = await agent.perceive_judged("sip-coffee")
    for ev in fires:                       # ev is a FireEvent
        print(ev.action, ev.goal_id, ev.text)   # -> fire caffeine Heads up — stepping back from caffeine?

    # The same frame again does NOT re-fire — rising-edge, not level-triggered.
    print(await agent.perceive_judged("sip-coffee"))   # -> []

asyncio.run(main())

perceive_judged(frame) returns a list of FireEvent(goal_id, action, text, entity_id, verdict), where action is "fire" or "ask". With real eyes, swap the fake for a frontier vision backend — the cognition layer above is unchanged:

agent = Percept.create(vision="gemini")          # needs percept-vision[gemini] + GEMINI_API_KEY
# frame is now real image bytes; everything downstream (gate, graph, executor) is identical.

Backends are selected by name ("fake" · "gemini" · "claude") or by passing an adapter instance, and can also be set via env (PERCEPT_VISION_BACKEND, etc.).

Architecture — two layers

percept cleanly splits the eyes from the brain. Perception is one stateless seam; everything stateful and proactive is cognition.

┌───────────────────────── COGNITION — the "brain" ──────────────────────────┐
│  three-state GATE       fire (known-yes) · ask (unknown) · silent (known-no) │
│      ▲ rising edge: fires only on false→true, with a refractory (no nag)      │
│  EXECUTOR               one firing path (sense · key · accumulate);           │
│                         transitions A→B, counting, deadlines/absence, verify  │
│  ENTITY GRAPH           stable ids; attributes with provenance + freshness    │
│                         (observed vs you-told-me)                             │
│  EDGE DETECTOR REGISTRY opt-in cheap signals propose; the brain counts/gates  │
│  general · deterministic · cheap · STATEFUL (has memory)                      │
└──────────────────────────────────────────────────────────────────────────────┘
                          ▲  feeds on  Verdict{satisfied, confidence}
┌───────────────────────── PERCEPTION — the "eyes" ──────────────────────────┐
│  vision.judge(condition, frame) -> Verdict        ★ ONE seam                  │
│  GeminiVision / AnthropicVision · stateless · one frame · ~1s/call cloud      │
└──────────────────────────────────────────────────────────────────────────────┘

The gate turns a noisy verdict stream into at most one alert on the rising edge, and falls back to ASK rather than guess when confidence is unreliable. The executor is the single firing path for every concern shape (watch, transition, count, deadline/absence, verification). The entity graph carries memory: stable ids and attributes that know whether they were observed or asserted by the user, and whether they're still fresh.

The edge detector registry is the edge-proposes · brain-counts · VLM-confirms split: cheap, opt-in detectors emit timed boundaries that the brain accumulates — counting reps without spending a vision token per frame. Three reference skills ship as registered detectors:

  • motion-periodicity — frame-diff motion peaks (rep boundaries);
  • acoustic-onset — energy spikes on the existing mic stream (no new capture path);
  • pose-openness — BlazePose-based rep peaks (opt-in: percept-harness[pose], MediaPipe).

A full request trace — "tell me when the milk is about to boil over" — is in docs/e2e-flow-milk-boilover.md, including an honest map of where the perception ceiling bites.

The three packages

package what install
percept-vision the cognition core — gate, entity graph, executor, events, scheduler, consent, fakes. Pure stdlib. pip install percept-vision (packages/percept-vision/)
percept-harness server-side transport shell + tier-0 salience gate (WatchSpec down / Tier0Signal up); the reference home of the edge detector skills. packages/percept-harness/
@percept/edge on-device reactive edge in JS/WASM: VAD + motion gate over the same WatchSpec/Tier0Signal wire-contract. packages/percept-edge/ (npm @percept/edge)

Benchmark — Percept Benchmark v1 (v0.5 DRAFT)

The benchmark holds the backbone fixed and measures the orchestration delta across a raw → core → e2e config ladder (no composite score — a vector of headlines). The current card is a v0.5 DRAFT (sampled, N≈12/track, gemini-2.5-flash) on the DeepMind Perception Test (CC-BY-4.0), the private golden ambiguity corpus, and RepCount-A; the e2e-relational accuracy cell is modeled (flagged ~), so the card is stamped DRAFT.

track measured (DRAFT, N≈12) reading
counting (RepCount-A) pose OBO 0.33 vs 0.29 (TransRAC, CVPR'22); nMAE 0.80 vs 0.44 training-free pose ties a trained baseline on OBO, but its failures on low-amplitude actions cost it on MAE
acoustic (PT Sound Loc., unseen source) vision-only recall 0.00 → 0.25–0.33 fused; fusion-FP 0–0.17 the recall-flip: fusion rescues ~a third of sound events vision-only never hears (honest at scale, not the single-clip 1.0)
relational (golden) confidence AUROC ≈ 0.51 (≈ chance); ASK-rate ≈ 0.4 the VLM's confidence does not separate right from wrong → justifies the ASK discipline over threshold-tuning
timeliness (PT action onsets) P-PAUC ≈ 60; reaction λ p50 ≈ 1s first real proactive-timeliness numbers (adapted PAUC)
edge event-recall (PT onsets) 0.00 @ thr 0.10 ⚠️ the motion-gate escalates on none of the subtle-action onsets (motion ~0.03–0.06 < 0.10) → e2e ≠ core there; a real calibration finding

A v0.5 DRAFT, not a RELEASE claim. Two findings the benchmark surfaced that a self-congratulatory eval would hide: the flat confidence AUROC (why the gate refuses to guess) and the edge event-recall of 0 on subtle actions (the motion-gate is mis-calibrated for them). Full plan, data, and references: packages/percept-vision/eval/BENCHMARK_PLAN.md.

Layout

packages/percept-vision/    the SDK (pip install percept-vision; import percept)
packages/percept-harness/   server-side tier-0 edge reference + detector skills
packages/percept-edge/      @percept/edge — on-device JS/WASM edge
docs/                       architecture & flow traces
eval/                       golden corpus (benchmark plan: packages/percept-vision/eval/)
Spec/                       the spec + implementation plan
Makefile                    test · eval-live · e2e · bench · check · reproduce

Docs

  • Quickstart — the 60-second offline snippet above; make test runs the fake-only unit lane.
  • Architecture / conceptsdocs/e2e-flow-milk-boilover.md (the two layers, the firing path, the perception ceiling) and Spec/ (the full spec + phase plan).
  • Backends — select by name (fake · gemini · claude · deepgram) or env (PERCEPT_*_BACKEND); add via the percept.backends entry-point group. Extras: [gemini] · [claude] · [deepgram] · [audio].
  • Benchmarkeval/BENCHMARK_PLAN.md.

Status

v0.1.0 — early / first public release. The cognition core runs and is tested offline with fakes (the L1 lane, no keys); the frontier backends (Gemini, Claude, Deepgram) and the edge packages are wired behind their seams. The benchmark is a v0.5 DRAFT card. We are now in real-life testing — APIs and numbers may change. Issues and contributions welcome.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

percept_vision-0.1.0.tar.gz (132.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

percept_vision-0.1.0-py3-none-any.whl (93.5 kB view details)

Uploaded Python 3

File details

Details for the file percept_vision-0.1.0.tar.gz.

File metadata

  • Download URL: percept_vision-0.1.0.tar.gz
  • Upload date:
  • Size: 132.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for percept_vision-0.1.0.tar.gz
Algorithm Hash digest
SHA256 660e06d8705e3e71e5081f62f8d84517e2702163de86c922b9dcc76e5d59fa9c
MD5 efa65804b6e51978639ca726df421287
BLAKE2b-256 4cce2e10502161a81a3b0fbc79ad17c2739dac2b274d8230f632b61ee0d3de14

See more details on using hashes here.

File details

Details for the file percept_vision-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: percept_vision-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 93.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for percept_vision-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c76a3fc1c53516301b1cff1cff39e884e894c527ec85f1cef763adf1fc494b2c
MD5 ccbce05596a900133a194d79dd863e68
BLAKE2b-256 7c0efcc08fbdac7b4bec75ffd29e37df987a09dc4dd83b129132ebead284d219

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page