Skip to main content

Smarter experiment evaluation for autoresearch — replaces eyeballing val_bpb with statistical verdicts

Project description

autojudge

Smarter experiment evaluation for autoresearch. Replaces eyeballing val_bpb with statistical verdicts that account for noise floor, Pareto efficiency, and trend context.

Install

pip install autojudge

Usage

# Evaluate the latest experiment
autojudge --results results.tsv --run-log run.log

# JSON output for scripting
autojudge --results results.tsv --format json | jq '.verdict'

# One-line verdict
autojudge --results results.tsv --quiet

Verdicts

Verdict Meaning Exit Code
STRONG_KEEP Improvement well above noise floor (3x+) 0
KEEP Improvement likely real (1.5-3x noise) 0
MARGINAL Improvement within noise (0.5-1.5x) 0
RETEST Indistinguishable from noise 0
DISCARD Regression detected 2
CRASH OOM or runtime error 2

Exit code 1 = input error (file not found, parse failure).

Scripting

# Auto-evaluate and commit or revert
autojudge --results results.tsv --run-log run.log && git commit -m "keep" || git reset --hard HEAD~1

JSON Output

{
  "verdict": "KEEP",
  "confidence": 0.82,
  "delta_pct": -1.01,
  "noise_floor": 0.02,
  "on_pareto_frontier": true,
  "suggestion": "Improvement looks real. Commit and continue."
}

How It Works

  • Estimates noise floor from pairwise differences between consecutive keeps
  • Scores improvement confidence as a ratio of delta to noise floor
  • Tracks Pareto frontier (val_bpb vs memory efficiency)
  • Detects streaks, plateaus, and diminishing returns
  • Parses run.log for OOM warnings, memory pressure, and training metrics

Requirements

  • Python >= 3.10
  • A results.tsv file from autoresearch

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autojudge-1.0.0.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autojudge-1.0.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file autojudge-1.0.0.tar.gz.

File metadata

  • Download URL: autojudge-1.0.0.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autojudge-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d04421d9a1eeefb78d3f0d94d2b5da987370458bd0e70be24fc57caee90e1b1f
MD5 8e61c89d272badd99b440648d032aaff
BLAKE2b-256 168e234aa1407cbad3774f3925baf9f3b213954d573a55dd710397e7ede27ea7

See more details on using hashes here.

File details

Details for the file autojudge-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: autojudge-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autojudge-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db0cee611581482611b349848ed4cc97d446d95d5b8ca4033bfe87af0c57ad36
MD5 e3a0a7d6ddbeeb0316a379bd6da43194
BLAKE2b-256 4dea4adca4b17b7b40c881be74ae0564c12ecc41edf6430bba00d858f3b4a56b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page