Recorded-video eval tools for smart-glasses apps
Project description
GlassKit CLI
This is the GlassKit command-line package. Its first command family is glasskit eval, a recorded-video evaluator for smart-glasses apps.
Smart-glasses apps are hard to test by hand because the input is physical, visual, and timing-sensitive. Given a recording of a workflow, glasskit eval lets you label the important moments and rerun the same checks whenever your prompts, model, parser, or app logic changes. Your adapter owns the app-specific call; the CLI handles video decoding, timestamp sampling, comparisons, reports, failure artifacts, and quality gates.
The current implementation assumes you use a uv-managed Python pipeline. If your use case does not fit the current model, please open an issue. We want to expand support based on real app needs.
Why Use glasskit eval?
- Turn real camera recordings into repeatable tests instead of relying on memory, screenshots, or manual replay.
- Test the vision path that users actually depend on: frames, prompts, model calls, response parsing, app logic, and thresholds.
- Label only stable moments in the video and skip ambiguous transitions that would make a test noisy.
- Keep app-specific behavior in your adapter while reusing the CLI for video handling, comparison modes, JSON reports, and failure images.
- Run the same suite locally and in CI with exit codes that distinguish setup errors from quality-gate failures.
What You Need
- A short recording of the workflow you want to evaluate.
- An
expected.yamlfile that says which timestamps or ranges should be checked and what each result should be. - A Python adapter function or object that receives decoded frames and returns JSON-like observations.
- A
uvcommand environment that can installglasskit.aiand your app's runtime dependencies.
Install
Use the CLI from the app repository that contains the eval suite, adapter, and app dependencies. Running from that directory keeps imports and relative paths predictable. If your app uses a .env file, pass it through the command runner, for example with uv run --env-file .env; the CLI does not load .env files by itself. With uv, install the published glasskit.ai package into the command environment and invoke the glasskit console script in one command:
cd path/to/your-app
uv run --with glasskit.ai glasskit eval --help
If you already installed the command another way (uv add --dev glasskit.ai), you can drop the uv run --with glasskit.ai prefix and run glasskit eval ... directly.
Run help when you need the exact options for the installed version:
uv run --with glasskit.ai glasskit --help
uv run --with glasskit.ai glasskit eval --help
uv run --with glasskit.ai glasskit eval run --help
Quick Start
Run these commands from your app repository so local imports, adapter files, and relative asset paths resolve the same way they do in your app. Add --env-file .env to uv run if your adapter expects environment variables from that file.
1. Create a Case from a Recording
uv run --with glasskit.ai glasskit eval init-case \
--suite eval-suite \
--case fold-step-001 \
--video path/to/recording.mp4 \
--target step_1 \
--label "Step 1"
This creates eval-suite/fold-step-001/, copies the recording into the case directory, and writes a starter expected.yaml.
2. Label the Moments You Care About
Edit eval-suite/fold-step-001/expected.yaml so the timestamps and expected values match the video. Start with one or two unambiguous samples:
version: 1
video: "video.mp4"
sampling:
every_s: 0.5
targets:
step_1:
label: Step 1
samples:
- at: 2.0
field: matches
expect: true
- range: [3.0, 5.0]
field: matches
expect: true
Use at for a single moment and range for a stable window. Avoid transition frames until you specifically want to measure transition behavior.
3. Check the Wiring with a Fake Adapter
Create eval_adapter.py in your app repository:
def evaluate_sample(sample, target):
return {
"target": target.id,
"timestamp_s": sample.timestamp_s,
"matches": target.id == "step_1",
}
This adapter does not judge the image. It only proves that the suite, video decoding, field extraction, comparison, and command wiring work before you connect a model backend.
4. Validate and Inspect the Schedule
uv run --with glasskit.ai glasskit eval validate --suite eval-suite --adapter eval_adapter.py:evaluate_sample
uv run --with glasskit.ai glasskit eval list-samples --suite eval-suite
5. Run the Eval
Run the fake adapter first:
uv run --with glasskit.ai glasskit eval run \
--adapter eval_adapter.py:evaluate_sample \
--suite eval-suite \
--min-pass-rate 1.0
After that passes, replace the fake adapter body with the real call into your app or model backend and run with the options you want for local debugging or CI:
uv run --with glasskit.ai glasskit eval run \
--adapter eval_adapter.py:create_evaluator \
--suite eval-suite \
--min-pass-rate 0.9 \
--output-json tmp/eval-results.json \
--save-failures \
--artifacts-dir tmp/eval-artifacts
The command exits 0 when all quality gates pass, 1 when the eval ran but one or more gates failed, and 2 for setup or runtime errors such as invalid YAML, unreadable videos, or adapter failures that are not being collected with --keep-going.
Recommended App Repo Layout
Keep the eval suite next to the adapter and app code that it exercises. This makes imports and relative asset paths predictable, and keeps environment handling close to the code that needs it.
your-app/
eval_adapter.py
adapter-config.yaml
eval-suite/
fold-step-001/
video.mp4
expected.yaml
fold-step-002/
video.mp4
expected.yaml
Commit the suite files that your team should share. Keep secrets in environment variables or uncommitted environment files, not in expected.yaml or adapter-config.yaml.
Eval Suite Layout
An eval suite is a directory containing one or more case directories. Each case has an expected.yaml file and either one video file in the case directory or a video: path in expected.yaml.
eval-suite/
fold-step-001/
video.mp4
expected.yaml
fold-step-002/
camera-recording.mov
expected.yaml
A single-case suite is also supported by placing expected.yaml directly in the suite directory.
Supported video suffixes are .mp4, .mov, .m4v, .webm, and .mkv. Timestamps in expected.yaml are seconds from the start of the clip, even when the container stores non-zero presentation timestamps internally.
Writing expected.yaml
Here is a representative case file:
version: 1
video: video.mp4
description: Fold step 1 should be detected after the crease is completed.
sampling:
every_s: 0.5
workflow:
targets:
- id: step_1
label: Step 1
prompt_id: origami.step_1
targets:
step_1:
label: Step 1
config:
reference_image: assets/step_1.png
samples:
- range: [0.0, 6.8]
expect: false
- range: [7.4, 11.8]
expect: true
step_2:
label: Step 2
samples:
- at: [4.0, 6.0]
expect: false
thresholds:
min_pass_rate: 0.9
max_failures: 2
per_target:
step_1:
min_pass_rate: 0.95
Ranges are interpreted as [start, end). With sampling.every_s: 0.5, range: [7.4, 8.6] expands to samples at 7.4, 7.9, and 8.4 seconds. Only declared range and at samples are evaluated; unlabeled gaps are skipped. A sample block must contain exactly one of range or at.
Use ranges for stable windows where the expected answer should be unchanged. Use at for isolated moments or when a transition is too short to sample safely. Avoid labeling ambiguous transition frames unless the ambiguity is exactly what you want to measure.
Case Fields
versionmust be1.videois an optional path to the case video, resolved relative to the case directory. If it is omitted, the case directory must contain exactly one supported video file.descriptionis optional and only for humans.sampling.every_ssets the default sample interval forrangeblocks in the case. The default is0.5seconds.workflow.targetsis optional metadata matched to targets by each entry'sid. Each entry must haveid;labelis optional; extra fields are passed to the adapter astarget.configunless overridden bytargets.<id>.config.targets.<target_id>.labelis optional display text for reports.targets.<target_id>.configis optional adapter-specific metadata for that target. This is where you can put prompt ids, reference image paths, class names, or other app-level data that the core CLI should not know about.targets.<target_id>.samplesis the required list of labeled sample blocks.thresholdsis optional case-level gating. It can containmin_pass_rate,max_failures, andper_target.<target_id>.min_pass_rate.
Expected Values and Comparison
The adapter returns a JSON-like value: null, boolean, number, string, array, or object. Each sample's expect value is compared with that returned value.
By default, booleans, strings, and null use exact comparison. Numbers use numeric comparison with zero tolerance unless you set one. Arrays and objects use exact comparison unless you choose another mode.
Use field when the adapter returns a structured object but the sample only cares about one nested value:
targets:
detector:
samples:
- at: 2.0
field: result.matches
expect: true
Field paths are dot-separated. Mapping keys are matched by name, and list indexes can be addressed with non-negative numeric path parts such as detections.0.label.
Supported compare.mode values are:
targets:
score:
samples:
- at: 1.0
expect: 0.75
compare:
mode: numeric
tolerance: 0.05
metadata:
samples:
- at: 1.0
expect:
result:
matches: true
compare:
mode: json_subset
objects:
samples:
- at: 1.0
expect: ["paper", "crease"]
compare:
mode: set_contains_all
exactrequires the observed value to equalexpect.numericrequires both values to be numbers and allowstolerance.json_subsetrequires every key and value inexpectto be present in the observed object. For arrays, each expected item must match at least one observed item.set_equalscompares arrays as unordered sets.set_contains_anypasses when at least one expected array item is present in the observed array.set_contains_allpasses when every expected array item is present in the observed array.
Suite-Level Thresholds
Put thresholds that should apply to the selected run as a whole in suite.yaml at the suite root. Suite-level min_pass_rate and max_failures gates are evaluated against the combined selected results. Suite-level per_target entries apply across the selected samples for each target id:
thresholds:
min_pass_rate: 0.9
max_failures: 5
per_target:
step_1:
min_pass_rate: 0.95
step_2:
min_pass_rate: 0.85
Every run also includes an adapter_errors gate. The run only succeeds if the adapter produced no runtime or comparison errors and every configured quality gate passed.
Failed comparisons are intentionally controlled by quality gates instead of a built-in default failure policy. If you run without --min-pass-rate, --min-target-pass-rate, --max-failures, or YAML thresholds, failed comparisons are reported in the summary and JSON output but do not make the command exit nonzero; adapter, runtime, or comparison errors still fail through the adapter_errors gate. Configure a pass-rate or max-failures gate for CI or any run where failed observations should fail the command.
CLI gates are useful for one-off CI jobs or local experiments:
uv run --with glasskit.ai glasskit eval run \
--adapter eval_adapter.py:create_evaluator \
--suite eval-suite \
--min-pass-rate 0.9 \
--min-target-pass-rate 0.85 \
--max-failures 3
--min-pass-rate and --max-failures override suite-level YAML values for the run. Because those flags define a run-level pass/fail policy, case-level YAML gates are not applied when either flag is set. --min-target-pass-rate applies the same target pass-rate gate to every target present in the selected results; when it is set, suite-level per_target gates are replaced by the uniform CLI gate. With --case, suite-level per-target gates for targets outside the selected case are skipped.
Commands
glasskit eval init-case
init-case creates a case directory, copies the source video into it when needed, and writes a starter expected.yaml:
uv run --with glasskit.ai glasskit eval init-case \
--suite eval-suite \
--case fold-step-001 \
--video recordings/fold-step-001.mp4 \
--target step_1 \
--label "Step 1"
The case name must be a single directory name under the suite. If the source video is already inside the case directory, the generated video: path is written relative to the case directory. Use --force to overwrite an existing expected.yaml or case video.
glasskit eval validate
validate checks suite structure, YAML schema, video readability, sample timestamps, and optional adapter importability:
uv run --with glasskit.ai glasskit eval validate --suite eval-suite
uv run --with glasskit.ai glasskit eval validate --suite eval-suite --adapter eval_adapter.py:create_evaluator
uv run --with glasskit.ai glasskit eval validate --suite eval-suite --case fold-step-001
Use validation before long or paid model evals. It catches most local mistakes without decoding sample frames or calling evaluate. Passing --adapter imports, constructs, and closes the adapter, so adapter setup side effects can still run.
glasskit eval list-samples
list-samples prints the expanded sample schedule:
uv run --with glasskit.ai glasskit eval list-samples --suite eval-suite
uv run --with glasskit.ai glasskit eval list-samples --suite eval-suite --case fold-step-001
This is the quickest way to confirm that your ranges, point samples, fields, and explicit comparison modes expand as intended. When a sample omits compare.mode, the Mode column is blank because the default mode is inferred when the sample is evaluated.
glasskit eval run
run decodes sample frames, calls the adapter, compares results, applies gates, prints a summary, and optionally writes JSON and failure artifacts:
uv run --with glasskit.ai glasskit eval run \
--adapter eval_adapter.py:create_evaluator \
--suite eval-suite \
--case fold-step-001 \
--adapter-config adapter-config.yaml \
--keep-going \
--verbose \
--output-json tmp/eval-results.json \
--save-failures \
--artifacts-dir tmp/eval-artifacts
--caselimits the run to one case directory by name.--adapter-configreads a YAML or JSON object and passes it to the adapter factory.--keep-goingrecords adapter or comparison errors as errored sample results instead of aborting the run on the first error.--verboseprints every sample result as it is produced and setsAdapterConfig.verbosefor the adapter.--output-jsonwrites a machine-readable report with summary counts, elapsed run duration, gate results, and per-sample observations. The final console summary also shows the elapsed duration.--save-failuressaves failed sample frames and per-result JSON files. If--artifacts-diris omitted, artifacts are written under.glasskit-artifactsin the suite directory.--allow-emptyallows suites or cases with no samples. This is mainly useful while drafting a suite, not for real quality gates.
Writing an Adapter
An adapter is the bridge between the generic CLI and your app. It receives decoded video frames plus target metadata, calls your app or model backend, and returns a JSON-like observation for each sample.
Pass an adapter as <module-or-file>:<callable>. The module side can be a Python import path such as my_app.eval_adapter or a file path such as eval_adapter.py. The callable side can name a function, class, or nested attribute such as create_evaluator or EvalAdapters.fold_checker.
The recommended shape is a factory that accepts one required AdapterConfig argument and returns an evaluator object:
from __future__ import annotations
import os
from typing import Any
def create_evaluator(config: Any) -> "FoldEvaluator":
settings = dict(config.config)
return FoldEvaluator(
api_key=os.environ["MODEL_API_KEY"],
model=settings.get("model", "default-model"),
verbose=bool(config.verbose),
)
class FoldEvaluator:
def __init__(self, *, api_key: str, model: str, verbose: bool) -> None:
self._api_key = api_key
self._model = model
self._verbose = verbose
async def evaluate(self, sample: Any, target: Any) -> bool:
image = sample.image
target_id = target.id
prompt_id = target.config.get("prompt_id", target_id)
return await call_model_backend(
api_key=self._api_key,
model=self._model,
image=image,
prompt_id=prompt_id,
timestamp_s=sample.timestamp_s,
)
async def close(self) -> None:
await close_model_client()
No-argument factories and evaluator classes are also supported, but they will not receive AdapterConfig. If the factory needs --adapter-config, --artifacts-dir, --verbose, or the suite path, give it one required argument.
evaluate(sample, target) may be synchronous or asynchronous. If the evaluator also implements evaluate_many(samples, target), the runner calls it once per target and uses the returned list as the observations for that target's samples. evaluate_many must return exactly one observation for each input sample in the same order.
close() is optional and may be synchronous or asynchronous. Use it to close HTTP clients, model sessions, or temporary resources.
Adapter Inputs
The factory receives AdapterConfig when it declares one required argument. AdapterConfig has these fields:
suite_pathis the resolved path to the eval suite.configis the object loaded from--adapter-config; it is an empty mapping when the option is omitted.artifacts_diris the path from--artifacts-dir, orNonewhen the option is omitted.verbosemirrors--verbose.
The evaluator receives a sample object with these fields:
imageis a decoded RGBPIL.Image.Imagefor the requested timestamp.timestamp_sis the requested sample timestamp in seconds from the start of the clip.frame_indexis the decoded video frame index chosen for that timestamp.sample_indexis the case-local sample index.video_pathis the source video path as a string.case_nameis the case directory name.
The evaluator also receives a target object with these fields:
idis the target id fromexpected.yaml.indexis the target's zero-based order in the case file.labelis the optional target label.configis the merged target metadata fromworkflow.targetsandtargets.<id>.config.
Simple Function Adapters
For smoke checks or fake local evals, the adapter target can be a function whose first two positional arguments are either image, target_id or sample, target:
def evaluate_frame(image, target_id):
return target_id == "step_1"
async def evaluate_sample(sample, target):
return {
"target": target.id,
"bright": sample.image.convert("L").getextrema()[1] > 180,
}
Function adapters are useful for testing suite wiring because they do not need a model backend. Production adapters should usually use the object shape so they can reuse clients and close resources cleanly.
Adapter Config Files
Use --adapter-config for values that should not live in expected.yaml, such as backend URLs, model names, thresholds owned by the adapter, or local asset paths:
api_url: "https://example.test/v1"
model: "vision-checker"
jpeg_quality: 90
The CLI only parses the file as YAML or JSON and passes the resulting object to AdapterConfig.config; it does not expand environment variables inside the file. Read secrets directly from environment variables in the adapter and use --adapter-config for non-secret runtime settings.
Adapter Return Values
Return the smallest stable value that answers the target. For a binary detector, return true or false. For a classifier, return a string label. For richer workflows, return an object and use field or json_subset in expected.yaml.
Good observations are deterministic, JSON-like, and easy to inspect in the JSON report:
return {
"result": {
"matches": True,
"confidence": 0.94,
"label": "folded",
}
}
Avoid returning SDK objects, dataclasses, images, bytes, or other values that cannot be serialized to JSON. They make reports harder to read and may fail when written to --output-json.
Practical Adapter Pattern
A model-backed adapter usually follows this flow:
import os
def create_evaluator(config):
settings = dict(config.config)
return MyEvaluator(
api_key=settings.get("api_key") or os.environ["MODEL_API_KEY"],
api_url=settings.get("api_url", "https://api.example.test"),
model=settings.get("model", "vision-model"),
)
class MyEvaluator:
def __init__(self, *, api_key, api_url, model):
self._client = make_async_client(api_key=api_key, base_url=api_url)
self._model = model
async def evaluate_many(self, samples, target):
return [await self.evaluate(sample, target) for sample in samples]
async def evaluate(self, sample, target):
prompt = target.config.get("prompt", f"Check {target.id}.")
image_payload = encode_image(sample.image)
response = await self._client.check(
model=self._model,
prompt=prompt,
image=image_payload,
)
return parse_response(response)
async def close(self):
await self._client.aclose()
Keep retries, response parsing, prompt construction, and backend-specific error handling in the adapter. Keep generic eval semantics in expected.yaml and the CLI.
Debugging Failed Runs
Start with validation:
uv run --with glasskit.ai glasskit eval validate --suite eval-suite --adapter eval_adapter.py:create_evaluator
Then list samples and run one case:
uv run --with glasskit.ai glasskit eval list-samples --suite eval-suite --case fold-step-001
uv run --with glasskit.ai glasskit eval run --suite eval-suite --case fold-step-001 --adapter eval_adapter.py:create_evaluator --verbose
If the adapter is unstable or expensive, add --keep-going --save-failures --output-json tmp/eval-results.json. The saved failure images show exactly what frame the adapter saw, and the JSON report includes the raw observation, extracted field, comparison mode, and reason for each sample.
Common issues:
adapter target not foundmeans the<module-or-file>:<callable>path imported successfully but the callable name could not be resolved.adapter import failedusually means the working directory,PYTHONPATH, or app environment does not include the adapter's dependencies.video file does not existmeans thevideo:path is wrong or is being resolved from the case directory differently than expected.sample ... exceeds video durationmeans a timestamp is beyond the readable video duration. Check the recording length and the units inexpected.yaml.missing fieldmeans the adapter returned a value that does not contain the sample'sfieldpath.invalid_observation: adapter returned nullmeans the adapter returnedNonefor a sample whose expected value was notnull.
Technical Details
For contributor-oriented implementation notes, see AGENTS.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glasskit_ai-0.1.1.tar.gz.
File metadata
- Download URL: glasskit_ai-0.1.1.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
791c95f84ba4f2e19786952d3ceffa7c1d86154f1a94d38c37bd4ed94700a609
|
|
| MD5 |
297986d7a1b29fe396aad3e3c302ef98
|
|
| BLAKE2b-256 |
0cde59fdd779a97560a7941f74fcbda498fc8ae3dd7a0c1a8ff32cb8b57484fe
|
File details
Details for the file glasskit_ai-0.1.1-py3-none-any.whl.
File metadata
- Download URL: glasskit_ai-0.1.1-py3-none-any.whl
- Upload date:
- Size: 34.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a10e12d19a2e24f49f8479c38ec90cc06e29c2e46f5b7ba3c5b13d63143bc41
|
|
| MD5 |
0bfb3a8e779811f1bc68075fa413cb2b
|
|
| BLAKE2b-256 |
b230c88bd46282504ac344a86b613f9550b346b9c2e85e844f0c5f974b5c570b
|