Audit the capability gap between frontier AI models and the models tested in academic papers.
Project description
frontierlag
Audit the capability gap between frontier AI and the models tested in academic papers.
Paste a DOI. Get a report: what model the paper tested, where it sat relative to the frontier at evaluation date, what configuration the paper disclosed, and whether the paper fails all three audit dimensions at the pre-registered thresholds from the companion study.
$ pip install frontierlag
$ frontierlag check 10.1038/s41591-024-03425-5
This package is the software companion to Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation (Gringras and Salahshoor, 2026). The audit dataset embedded here is the frozen snapshot used in that paper; updates ship as point releases.
What it does
frontierlag classifies published AI-capability evaluations against the pre-registered audit dimensions from Gringras (2026). Three primary dimensions (the H5 compound-failure outcome), one secondary magnitude (capability-elicitation shortfall), and one tertiary transparency vector (temporal, tier, elicitation):
| Dimension | What it captures |
|---|---|
| Capability failure | eci_gap ≥ 12 ECI, anchored to the mean observed within-family major-generation jump on the frozen April-2026 Epoch snapshot. |
| Elicitation failure | OR-of-three: reasoning-mode undisclosed for a reasoning-capable model, OR tool-use undisclosed for a tool-capable model, OR scaffolding undisclosed where a scaffolded baseline existed at evaluation date. AND-of-three reported alongside as a strict-conjunction sensitivity. |
| Interpretive failure | AND-of-two (pre-registered primary): no human comparator AND conclusion_framing = ai_generic. OR-of-two reported alongside as the inclusive sensitivity. Admissibility filter: tasks with machine-verifiable references (oracle code tests, MATH, exact-match QA) have the comparator signal suppressed. |
A paper flagged on all three at the pre-registered thresholds is a compound failure (pre-reg §2.2 H5).
The package also returns:
capability_elicitation_shortfall— the secondary magnitudeeci_gap × (1 - config_elicitation_index), capturing the interaction between capability distance and configuration under-disclosure.- Three-component vector
(temporal_gap_months, tier_gap_count, elicitation_gap_fraction)— readers do their own weighting.
The package does not estimate counterfactual capability; it does not claim "the paper's conclusion would have been X if they had used Y." Descriptive, not normative: the audit documents structural lag, it does not rank authors or score papers as "bad research."
Quick start
import frontierlag as fl
# By DOI (hits the frozen corpus if the paper is in the audit; otherwise
# resolves publication date via CrossRef and leaves you to supply the model).
report = fl.check("10.1038/s41591-024-03425-5")
print(report.to_text())
# Override / supply fields for a paper not in the frozen corpus.
report = fl.check(
"10.1000/your-doi",
primary_model="GPT-4",
evaluation_date="2024-06-01",
configuration_disclosures={
"model_version_exact": True,
"access_date": True,
"reasoning_mode": None,
"tool_use": False,
},
)
# Audit already-extracted metadata.
from frontierlag import audit, PaperMetadata
m = PaperMetadata(
primary_model="GPT-3.5",
publication_date="2025-07-01",
evaluation_date="2025-05-01",
configuration_disclosures={"reasoning_mode": False, "tool_use": False},
human_comparator_present=False,
conclusion_framing="ai_generic",
task_admissibility="expected",
domain="medicine",
)
report = audit(m) # default: AND-of-two pre-registered primary
print(report.compound_failure) # pre-registered binary
print(report.capability_elicitation_shortfall) # secondary magnitude
print((report.temporal_gap_months, report.tier_gap_count, report.elicitation_gap_fraction))
# Provenance for false-positive diagnosis.
diag = audit(m, return_provenance=True).provenance
print(diag["classifications"]["compound_failure_prereg"])
print(diag["inputs"])
# Individual lookups.
fl.lookup_model("claude-3.5-sonnet")
fl.get_frontier_at_date("2025-06-01")
fl.list_known_models()
CLI
frontierlag check <DOI> audit a paper
frontierlag lookup <MODEL> single-model metadata
frontierlag frontier <YYYY-MM-DD> frontier at a date
frontierlag models list known canonical names
frontierlag info version + data-freeze date
Every command accepts --json for machine-readable output. frontierlag check accepts --model, --eval-date, and --config-file to override or supply fields a paper does not otherwise provide.
Data freeze
The embedded dataset is frozen at FREEZE_DATE = 2026-04-01. Every report prints this at the top so readers know how stale the comparison is. Updates ship as point releases.
| File | Source |
|---|---|
data/eci_scores.csv |
Epoch AI Capabilities Index snapshot (Epoch AI, 2026) |
data/monthly_frontier_trajectory.csv |
Derived from ECI + model release dates |
data/model_version_lookup.json |
Maintainer-curated, cross-checked against Epoch AI model tracker |
data/frozen_audit.json |
Audit-dataset DOI lookup index |
Install
pip install frontierlag
Requires Python ≥ 3.9. Runtime dependencies are requests and pyyaml; no heavy scientific stack.
Companion artefacts
- Empirical audit paper — Frontier Lag (Gringras and Salahshoor, 2026).
- Reporting checklist — VERSIO-AI v1.2.
- Pre-registration — Open Science Framework,
10.17605/OSF.IO/7XM3D. - Live web tool —
https://frontierlag.org.
Citation
@software{gringras2026frontierlag,
author = {Gringras, David and Salahshoor, Misha},
title = {frontierlag: A {Python} package for auditing the capability gap of published {AI} evaluations},
year = {2026},
version = {1.0.0},
url = {https://frontierlag.org}
}
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file frontierlag-1.0.0.tar.gz.
File metadata
- Download URL: frontierlag-1.0.0.tar.gz
- Upload date:
- Size: 2.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d278987f26f8aba16ad8eb21107dde9baadabbf67db06bd342dfaf4422e5fbdb
|
|
| MD5 |
46c4e2b0ed0b44a343f4169553cc3e4f
|
|
| BLAKE2b-256 |
b3737f158e733bf3c130c601f5ecf4ac6e626a021104c97bd59dd591cdb83c6b
|
File details
Details for the file frontierlag-1.0.0-py3-none-any.whl.
File metadata
- Download URL: frontierlag-1.0.0-py3-none-any.whl
- Upload date:
- Size: 2.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbcb30480d6896f2c321af99a7d9524871891e28bd7ad5efe71762ab422bf109
|
|
| MD5 |
dce434a0c34d5c9f1bde90132ee9ff20
|
|
| BLAKE2b-256 |
56a320bc8a783a47ee8e5b4f5f2b468fe0e317031df0293dda02d7af0302e0bb
|