Automatically generate evaluators that approximate human judgements

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

XenonMolecule

These details have not been verified by PyPI

Project description

AutoMetrics

Automatically induce evaluation metrics that approximate human judgment from fewer than 100 labels.

AutoMetrics takes a small set of human-labeled examples (thumbs, Likert, pairwise — under 100 points) and produces a single interpretable evaluator for your task. It synthesizes candidate criteria with LLM judges, retrieves complementary metrics from a curated bank of 48, and composes them with PLS regression. Across five tasks in the paper, it beats LLM-as-a-judge baselines by up to +33.4% Kendall τ, and in an agentic-task case study it matches the performance of a verifiable reward.

AutoMetrics pipeline

Install

pip install autometrics-ai

Base install requires only Python 3.9+. Heavy dependencies (Java 21, pyserini, pylate, bert_score, …) are loaded lazily — they're needed only if you opt into features that use them.

Quickstart

export OPENAI_API_KEY="sk-..."
python examples/tutorial.py

Builds a tiny custom dataset, generates a handful of LLM-judge metrics for your task, fits PLS to your human scores, and writes an interactive HTML report to artifacts/. No Java, no GPU, no bank dependencies required for this path.

How it works

Generate. Propose task-specific candidate metrics — single-criterion, rubric, example-based, and MIPROv2-optimized LLM judges (10 + 5 + 1 + 1 by default).
Retrieve. Rank the generated candidates alongside the 48-metric MetricBank (ColBERT → LLM reranker) and keep the top k=30.
Regress. Fit Partial Least Squares on the training set to select n=5 predictive metrics and learn their weights.
Report. Emit (a) the aggregated metric as a Python class you can import, (b) a Metric Card per generated metric, and (c) an HTML report card with coefficients, correlation, robustness, runtime, and per-example feedback.

For datasets of ≤100 rows AutoMetrics runs in generated-only mode by default, skipping the metric bank entirely.

See the paper (ICLR 2026) for the full method, ablations, and case study.

Examples

File	Scope	Requires
`examples/tutorial.py`	Dead-simple 8-row demo, generated-only	`OPENAI_API_KEY`
`examples/autometrics_simple_example.py`	Full pipeline with defaults on HelpSteer	+ Java 21, bank extras
`examples/autometrics_example.py`	Custom generators, retriever, regressor, priors	+ your own config

Narrative walkthrough: examples/TUTORIAL.md.

Use on your own data

import dspy, pandas as pd
from autometrics.autometrics import Autometrics
from autometrics.dataset.Dataset import Dataset

df = pd.DataFrame({
    "id": ["1", "2", "3"],
    "input":  ["prompt 1", "prompt 2", "prompt 3"],
    "output": ["response 1", "response 2", "response 3"],
    "score":  [4.5, 3.2, 4.8],
})
dataset = Dataset(
    dataframe=df, name="MyTask",
    data_id_column="id", input_column="input", output_column="output",
    target_columns=["score"], ignore_columns=["id"], metric_columns=[],
    reference_columns=[], task_description="Describe your task in one sentence.",
)

llm = dspy.LM("openai/gpt-4o-mini")
results = Autometrics().run(
    dataset=dataset, target_measure="score",
    generator_llm=llm, judge_llm=llm,
)

final = results["regression_metric"]       # an importable Metric
final.predict(dataset)                     # scores on any Dataset with same schema

Requirements

Component	Needed for
Python ≥ 3.9	everything
`OPENAI_API_KEY` (or any LiteLLM-compatible endpoint)	LLM-based generation and judging
Java 21	BM25 retrieval over the full MetricBank (`pyserini`)
GPU	some bank metrics (reward models, large BERTScore); CPU works for generated-only

Repository layout

autometrics/
├── autometrics.py            Pipeline orchestrator
├── dataset/                  Dataset interface + built-in tasks
├── metrics/                  MetricBank (48 metrics) + generated metric scaffolds
├── generator/                LLM judge proposers (single, rubric, examples, G-Eval, optimized)
├── recommend/                Retrievers (BM25, ColBERT, LLMRec, Pipelined)
├── aggregator/regression/    PLS (default), Lasso, Ridge, ElasticNet, HotellingPLS
└── util/report_card.py       HTML report generator
examples/                     Tutorial scripts and walkthroughs

Optional extras

Install extras for metric-bank components with heavier dependencies

pip install "autometrics-ai[bert-score,rouge,bleurt]"
pip install "autometrics-ai[reward-models,gpu]"
pip install "autometrics-ai[mauve,parascore,lens,fasttext]"

Individual clusters: fasttext, lens, parascore, bert-score, bleurt, moverscore, rouge, meteor, infolm, mauve, spacy, hf-evaluate, reward-models, readability, gpu. See pyproject.toml for the full mapping. Metrics whose dependencies are missing are silently skipped with a warning — no install is strictly required.

Citation

@inproceedings{ryan2026autometrics,
  title   = {AutoMetrics: Approximate Human Judgments with Automatically Generated Evaluators},
  author  = {Ryan, Michael J and Zhang, Yanzhe and Salunkhe, Amol and Chu, Yi and Xu, Di and Yang, Diyi},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year    = {2026},
  url     = {https://openreview.net/forum?id=ymJuBifPUy}
}

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

XenonMolecule

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 22, 2026

0.0.6

Dec 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autometrics_ai-0.1.0.tar.gz (529.3 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autometrics_ai-0.1.0-py3-none-any.whl (722.4 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file autometrics_ai-0.1.0.tar.gz.

File metadata

Download URL: autometrics_ai-0.1.0.tar.gz
Upload date: Apr 22, 2026
Size: 529.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autometrics_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4b88c8195bc0ffa7f6139e32dfc2aa2e59a7520ee70a07fa47b498d97d6e76aa`
MD5	`222572a4617c59b9c4e1cad1549bee0b`
BLAKE2b-256	`6205cbbcf51aa87132001adbffc1f1f9e494fc437e058f89e828603f57d101ff`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autometrics_ai-0.1.0.tar.gz:

Publisher: python-publish.yml on SALT-NLP/autometrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autometrics_ai-0.1.0.tar.gz
- Subject digest: 4b88c8195bc0ffa7f6139e32dfc2aa2e59a7520ee70a07fa47b498d97d6e76aa
- Sigstore transparency entry: 1359815535
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: SALT-NLP/autometrics@aa24fa84dade29f62e45fe795bec5fdecd3229d9
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/SALT-NLP
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@aa24fa84dade29f62e45fe795bec5fdecd3229d9
- Trigger Event: release

File details

Details for the file autometrics_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: autometrics_ai-0.1.0-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 722.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autometrics_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8afdc934b3432af9843f29ab711f615553b31e96c83411908c06db0e069fd673`
MD5	`289bbecbf1eec6b573c0ba430f15a5a4`
BLAKE2b-256	`f4625afbcd60eff653f236accfdb62f1e5bfaf08c955c600c9006482d593f18b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autometrics_ai-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on SALT-NLP/autometrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autometrics_ai-0.1.0-py3-none-any.whl
- Subject digest: 8afdc934b3432af9843f29ab711f615553b31e96c83411908c06db0e069fd673
- Sigstore transparency entry: 1359815542
- Sigstore integration time: Apr 22, 2026
Source repository:
- Permalink: SALT-NLP/autometrics@aa24fa84dade29f62e45fe795bec5fdecd3229d9
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/SALT-NLP
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@aa24fa84dade29f62e45fe795bec5fdecd3229d9
- Trigger Event: release

autometrics-ai 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AutoMetrics

Install

Quickstart

How it works

Examples

Use on your own data

Requirements

Repository layout

Optional extras

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance