Skip to main content

Audit the capability gap between frontier AI models and the models tested in academic papers.

Project description

frontierlag

Audit the capability gap between frontier AI and the models tested in academic papers.

Paste a DOI. Get a report: what model the paper tested, where it sat relative to the frontier at evaluation date, what configuration the paper disclosed, and whether the paper fails all three audit dimensions at the pre-registered thresholds from the companion study.

$ pip install frontierlag
$ frontierlag check 10.1038/s41591-024-03425-5

This package is the software companion to Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation (Gringras and Salahshoor, 2026). The audit dataset embedded here is the frozen snapshot used in that paper; updates ship as point releases.


What it does

frontierlag classifies published AI-capability evaluations against the pre-registered audit dimensions from Gringras (2026). Three primary dimensions (the H5 compound-failure outcome), one secondary magnitude (capability-elicitation shortfall), and one tertiary transparency vector (temporal, tier, elicitation):

Dimension What it captures
Capability failure eci_gap ≥ 12 ECI, anchored to the mean observed within-family major-generation jump on the frozen April-2026 Epoch snapshot.
Elicitation failure OR-of-three: reasoning-mode undisclosed for a reasoning-capable model, OR tool-use undisclosed for a tool-capable model, OR scaffolding undisclosed where a scaffolded baseline existed at evaluation date. AND-of-three reported alongside as a strict-conjunction sensitivity.
Interpretive failure AND-of-two (pre-registered primary): no human comparator AND conclusion_framing = ai_generic. OR-of-two reported alongside as the inclusive sensitivity. Admissibility filter: tasks with machine-verifiable references (oracle code tests, MATH, exact-match QA) have the comparator signal suppressed.

A paper flagged on all three at the pre-registered thresholds is a compound failure (pre-reg §2.2 H5).

The package also returns:

  • capability_elicitation_shortfall — the secondary magnitude eci_gap × (1 - config_elicitation_index), capturing the interaction between capability distance and configuration under-disclosure.
  • Three-component vector (temporal_gap_months, tier_gap_count, elicitation_gap_fraction) — readers do their own weighting.

The package does not estimate counterfactual capability; it does not claim "the paper's conclusion would have been X if they had used Y." Descriptive, not normative: the audit documents structural lag, it does not rank authors or score papers as "bad research."


Quick start

import frontierlag as fl

# By DOI (hits the frozen corpus if the paper is in the audit; otherwise
# resolves publication date via CrossRef and leaves you to supply the model).
report = fl.check("10.1038/s41591-024-03425-5")
print(report.to_text())

# Override / supply fields for a paper not in the frozen corpus.
report = fl.check(
    "10.1000/your-doi",
    primary_model="GPT-4",
    evaluation_date="2024-06-01",
    configuration_disclosures={
        "model_version_exact": True,
        "access_date": True,
        "reasoning_mode": None,
        "tool_use": False,
    },
)

# Audit already-extracted metadata.
from frontierlag import audit, PaperMetadata
m = PaperMetadata(
    primary_model="GPT-3.5",
    publication_date="2025-07-01",
    evaluation_date="2025-05-01",
    configuration_disclosures={"reasoning_mode": False, "tool_use": False},
    human_comparator_present=False,
    conclusion_framing="ai_generic",
    task_admissibility="expected",
    domain="medicine",
)
report = audit(m)  # default: AND-of-two pre-registered primary
print(report.compound_failure)                 # pre-registered binary
print(report.capability_elicitation_shortfall) # secondary magnitude
print((report.temporal_gap_months, report.tier_gap_count, report.elicitation_gap_fraction))

# Provenance for false-positive diagnosis.
diag = audit(m, return_provenance=True).provenance
print(diag["classifications"]["compound_failure_prereg"])
print(diag["inputs"])

# Individual lookups.
fl.lookup_model("claude-3.5-sonnet")
fl.get_frontier_at_date("2025-06-01")
fl.list_known_models()

CLI

frontierlag check <DOI>               audit a paper
frontierlag lookup <MODEL>            single-model metadata
frontierlag frontier <YYYY-MM-DD>     frontier at a date
frontierlag models                    list known canonical names
frontierlag info                      version + data-freeze date

Every command accepts --json for machine-readable output. frontierlag check accepts --model, --eval-date, and --config-file to override or supply fields a paper does not otherwise provide.


Data freeze

The embedded dataset is frozen at FREEZE_DATE = 2026-04-01. Every report prints this at the top so readers know how stale the comparison is. Updates ship as point releases.

File Source
data/eci_scores.csv Epoch AI Capabilities Index snapshot (Epoch AI, 2026)
data/monthly_frontier_trajectory.csv Derived from ECI + model release dates
data/model_version_lookup.json Maintainer-curated, cross-checked against Epoch AI model tracker
data/frozen_audit.json Audit-dataset DOI lookup index

Install

pip install frontierlag

Requires Python ≥ 3.9. Runtime dependencies are requests and pyyaml; no heavy scientific stack.


Companion artefacts

  • Empirical audit paperFrontier Lag (Gringras and Salahshoor, 2026).
  • Reporting checklist — VERSIO-AI v1.2.
  • Pre-registration — Open Science Framework, 10.17605/OSF.IO/7XM3D.
  • Live web toolhttps://frontierlag.org.

Citation

@software{gringras2026frontierlag,
  author  = {Gringras, David and Salahshoor, Misha},
  title   = {frontierlag: A {Python} package for auditing the capability gap of published {AI} evaluations},
  year    = {2026},
  version = {1.0.0},
  url     = {https://frontierlag.org}
}

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

frontierlag-1.0.0.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

frontierlag-1.0.0-py3-none-any.whl (2.3 MB view details)

Uploaded Python 3

File details

Details for the file frontierlag-1.0.0.tar.gz.

File metadata

  • Download URL: frontierlag-1.0.0.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for frontierlag-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d278987f26f8aba16ad8eb21107dde9baadabbf67db06bd342dfaf4422e5fbdb
MD5 46c4e2b0ed0b44a343f4169553cc3e4f
BLAKE2b-256 b3737f158e733bf3c130c601f5ecf4ac6e626a021104c97bd59dd591cdb83c6b

See more details on using hashes here.

File details

Details for the file frontierlag-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: frontierlag-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for frontierlag-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dbcb30480d6896f2c321af99a7d9524871891e28bd7ad5efe71762ab422bf109
MD5 dce434a0c34d5c9f1bde90132ee9ff20
BLAKE2b-256 56a320bc8a783a47ee8e5b4f5f2b468fe0e317031df0293dda02d7af0302e0bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page