Skip to main content

Smarter experiment evaluation for autoresearch — replaces eyeballing val_bpb with statistical verdicts

Project description

autojudge

Smarter experiment evaluation for autoresearch. Replaces eyeballing val_bpb with statistical verdicts that account for noise floor, Pareto efficiency, and trend context.

Install

pip install autojudge

Usage

# Evaluate the latest experiment
autojudge --results results.tsv --run-log run.log

# JSON output for scripting
autojudge --results results.tsv --format json | jq '.verdict'

# One-line verdict
autojudge --results results.tsv --quiet

Verdicts

Verdict Meaning Exit Code
STRONG_KEEP Improvement well above noise floor (3x+) 0
KEEP Improvement likely real (1.5-3x noise) 0
MARGINAL Improvement within noise (0.5-1.5x) 0
RETEST Indistinguishable from noise 0
DISCARD Regression detected 2
CRASH OOM or runtime error 2

Exit code 1 = input error (file not found, parse failure).

Scripting

# Auto-evaluate and commit or revert
if autojudge --results results.tsv --run-log run.log; then
    git commit -m "keep"
else
    git reset --hard HEAD~1
fi

JSON Output

{
  "verdict": "KEEP",
  "confidence": 0.82,
  "val_bpb": 3.87,
  "prev_best": 3.91,
  "delta": -0.04,
  "delta_pct": -1.01,
  "noise_floor": 0.02,
  "on_pareto_frontier": true,
  "suggestion": "Good improvement in val_bpb. This experiment is on the Pareto frontier. Good progress.",
  "...": "plus efficiency, trends, pareto_frontier, experiment details"
}

How It Works

  • Estimates noise floor from pairwise differences between consecutive keeps
  • Scores improvement confidence as a ratio of delta to noise floor
  • Tracks Pareto frontier (val_bpb vs memory efficiency)
  • Detects streaks, plateaus, and diminishing returns
  • Parses run.log for OOM warnings, memory pressure, and training metrics

Requirements

  • Python >= 3.10
  • A results.tsv file from autoresearch

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autojudge-1.0.1.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autojudge-1.0.1-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file autojudge-1.0.1.tar.gz.

File metadata

  • Download URL: autojudge-1.0.1.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autojudge-1.0.1.tar.gz
Algorithm Hash digest
SHA256 1c9b26cbd5145563668f68c8dfc1ac1d86b40953d5c506785e54751c9bfe2192
MD5 e1da6cac8db2621084449abfef8848c3
BLAKE2b-256 e2fae2aa23215d2ae07efd64ac8516f794aa929a92633ecddd9659eadbc8502f

See more details on using hashes here.

File details

Details for the file autojudge-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: autojudge-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autojudge-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b42d1e2f52ab040edaa44f8b80ec6527174cb4e585b12f68f345f053dc4b3871
MD5 d294314774692ae643fc8c98d88402a3
BLAKE2b-256 588241b59b7f24bc7ab3d3677815044dfc8ac4442414c0859feb2e340a8669ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page