Skip to main content

Automatically generate evaluators that approximate human judgements

Project description

AutoMetrics

Automatically induce evaluation metrics that approximate human judgment from fewer than 100 labels.

PyPI Python License Paper

AutoMetrics takes a small set of human-labeled examples (thumbs, Likert, pairwise — under 100 points) and produces a single interpretable evaluator for your task. It synthesizes candidate criteria with LLM judges, retrieves complementary metrics from a curated bank of 48, and composes them with PLS regression. Across five tasks in the paper, it beats LLM-as-a-judge baselines by up to +33.4% Kendall τ, and in an agentic-task case study it matches the performance of a verifiable reward.

AutoMetrics pipeline


Install

pip install autometrics-ai

Base install requires only Python 3.9+. Heavy dependencies (Java 21, pyserini, pylate, bert_score, …) are loaded lazily — they're needed only if you opt into features that use them.

Quickstart

export OPENAI_API_KEY="sk-..."
python examples/tutorial.py

Builds a tiny custom dataset, generates a handful of LLM-judge metrics for your task, fits PLS to your human scores, and writes an interactive HTML report to artifacts/. No Java, no GPU, no bank dependencies required for this path.

How it works

  1. Generate. Propose task-specific candidate metrics — single-criterion, rubric, example-based, and MIPROv2-optimized LLM judges (10 + 5 + 1 + 1 by default).
  2. Retrieve. Rank the generated candidates alongside the 48-metric MetricBank (ColBERT → LLM reranker) and keep the top k=30.
  3. Regress. Fit Partial Least Squares on the training set to select n=5 predictive metrics and learn their weights.
  4. Report. Emit (a) the aggregated metric as a Python class you can import, (b) a Metric Card per generated metric, and (c) an HTML report card with coefficients, correlation, robustness, runtime, and per-example feedback.

For datasets of ≤100 rows AutoMetrics runs in generated-only mode by default, skipping the metric bank entirely.

See the paper (ICLR 2026) for the full method, ablations, and case study.

Examples

File Scope Requires
examples/tutorial.py Dead-simple 8-row demo, generated-only OPENAI_API_KEY
examples/autometrics_simple_example.py Full pipeline with defaults on HelpSteer + Java 21, bank extras
examples/autometrics_example.py Custom generators, retriever, regressor, priors + your own config

Narrative walkthrough: examples/TUTORIAL.md.

Use on your own data

import dspy, pandas as pd
from autometrics.autometrics import Autometrics
from autometrics.dataset.Dataset import Dataset

df = pd.DataFrame({
    "id": ["1", "2", "3"],
    "input":  ["prompt 1", "prompt 2", "prompt 3"],
    "output": ["response 1", "response 2", "response 3"],
    "score":  [4.5, 3.2, 4.8],
})
dataset = Dataset(
    dataframe=df, name="MyTask",
    data_id_column="id", input_column="input", output_column="output",
    target_columns=["score"], ignore_columns=["id"], metric_columns=[],
    reference_columns=[], task_description="Describe your task in one sentence.",
)

llm = dspy.LM("openai/gpt-4o-mini")
results = Autometrics().run(
    dataset=dataset, target_measure="score",
    generator_llm=llm, judge_llm=llm,
)

final = results["regression_metric"]       # an importable Metric
final.predict(dataset)                     # scores on any Dataset with same schema

Requirements

Component Needed for
Python ≥ 3.9 everything
OPENAI_API_KEY (or any LiteLLM-compatible endpoint) LLM-based generation and judging
Java 21 BM25 retrieval over the full MetricBank (pyserini)
GPU some bank metrics (reward models, large BERTScore); CPU works for generated-only

Repository layout

autometrics/
├── autometrics.py            Pipeline orchestrator
├── dataset/                  Dataset interface + built-in tasks
├── metrics/                  MetricBank (48 metrics) + generated metric scaffolds
├── generator/                LLM judge proposers (single, rubric, examples, G-Eval, optimized)
├── recommend/                Retrievers (BM25, ColBERT, LLMRec, Pipelined)
├── aggregator/regression/    PLS (default), Lasso, Ridge, ElasticNet, HotellingPLS
└── util/report_card.py       HTML report generator
examples/                     Tutorial scripts and walkthroughs

Optional extras

Install extras for metric-bank components with heavier dependencies
pip install "autometrics-ai[bert-score,rouge,bleurt]"
pip install "autometrics-ai[reward-models,gpu]"
pip install "autometrics-ai[mauve,parascore,lens,fasttext]"

Individual clusters: fasttext, lens, parascore, bert-score, bleurt, moverscore, rouge, meteor, infolm, mauve, spacy, hf-evaluate, reward-models, readability, gpu. See pyproject.toml for the full mapping. Metrics whose dependencies are missing are silently skipped with a warning — no install is strictly required.

Citation

@inproceedings{ryan2026autometrics,
  title   = {AutoMetrics: Approximate Human Judgments with Automatically Generated Evaluators},
  author  = {Ryan, Michael J and Zhang, Yanzhe and Salunkhe, Amol and Chu, Yi and Xu, Di and Yang, Diyi},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year    = {2026},
  url     = {https://openreview.net/forum?id=ymJuBifPUy}
}

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autometrics_ai-0.1.0.tar.gz (529.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autometrics_ai-0.1.0-py3-none-any.whl (722.4 kB view details)

Uploaded Python 3

File details

Details for the file autometrics_ai-0.1.0.tar.gz.

File metadata

  • Download URL: autometrics_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 529.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autometrics_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4b88c8195bc0ffa7f6139e32dfc2aa2e59a7520ee70a07fa47b498d97d6e76aa
MD5 222572a4617c59b9c4e1cad1549bee0b
BLAKE2b-256 6205cbbcf51aa87132001adbffc1f1f9e494fc437e058f89e828603f57d101ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for autometrics_ai-0.1.0.tar.gz:

Publisher: python-publish.yml on SALT-NLP/autometrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autometrics_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: autometrics_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 722.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autometrics_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8afdc934b3432af9843f29ab711f615553b31e96c83411908c06db0e069fd673
MD5 289bbecbf1eec6b573c0ba430f15a5a4
BLAKE2b-256 f4625afbcd60eff653f236accfdb62f1e5bfaf08c955c600c9006482d593f18b

See more details on using hashes here.

Provenance

The following attestation bundles were made for autometrics_ai-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on SALT-NLP/autometrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page