anvil-eval

A research-first, evaluation-first inference library.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

galoaa.b

These details have not been verified by PyPI

Project description

anvil

A research-first, evaluation-first inference library.

Design manuscript · Install · Quickstart · Milestones

Status: alpha (v0.4.0). DoLa contrastive decoding, CaaS LLM tier, MultiTurnFewshot, Classify request type, per-request logits processors (vLLM + HF), HiddenStateSpec activation capture, real dataset SHAs in manifests, lm-eval task shim, and CI on Python 3.11/3.12 are all live. CUDA wheels (cu121/cu128/cu130) on GitHub Releases; CPU wheel on PyPI.

What this is

Anvil is not trying to be the fastest inference engine. vLLM and SGLang win throughput. Anvil's identity is correctness, reproducibility, and research ergonomics:

Every run produces a content-hashed Manifest: two runs with the same manifest must produce identical numbers, byte-for-byte.
Every chat template, tokenization, sampler, and image input is a versioned, hashed object — not a string loaded from a file at runtime.
Day-zero new-model coverage via a transformers slow path; popular architectures graduate to a fast path.
Per-request logits processors and hidden-state extraction are stable public APIs (the V0-vLLM API, restored). DoLa ships out of the box.
A preflight CaaS agent (rule engine + 15-entry KB + LLM fallback tier) runs before every major run, catches silent failures, and either fixes them or refuses to publish a manifest that crossed a silent regression.

See docs/design.md for the full design rationale.

Install

uv pip install anvil-eval

The CPU wheel ships to PyPI. CUDA wheels (cu121, cu128, cu130) are attached to each GitHub Release and can be installed directly:

# Example: CUDA 12.1
pip install https://github.com/bishoymoussa/anvil/releases/download/v0.4.0/anvil_eval-0.4.0-py3-none-any-cu121.whl

Import name: the Python package is still import anvil — only the PyPI distribution name is anvil-eval.

For development:

uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[dev]"

Optional extras: .[vllm] for the vLLM backend, .[multimodal] for video/audio, .[xgrammar] for tool calling, .[mcp] for the MCP server.

MCP server — let your AI agent run evaluations

Install the extra and add Anvil to your Claude Desktop or Claude Code config:

pip install 'anvil-eval[mcp]'

{
  "mcpServers": {
    "anvil": { "command": "anvil", "args": ["mcp"] }
  }
}

Your agent now has five tools: anvil_list_tasks, anvil_eval, anvil_manifest_diff, anvil_manifest_verify, and anvil_doctor. Ask it naturally:

"Run MMLU 5-shot on Llama-3.1-8B and save the manifest to run.json" "Compare run.json with last_week.json and explain the score gap" "Check my environment for any CUDA or token issues"

For a local HTTP endpoint instead of stdio:

anvil mcp --http   # listens on localhost:8765

Quickstart

import anvil

result = anvil.eval(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tasks=["mmlu", "gsm8k", "humaneval"],
)
print(result.scores)               # {"mmlu": {...}, "gsm8k": {...}, ...}
result.manifest.save("run.json")

# CLI equivalent
anvil eval --model meta-llama/Llama-3.1-8B-Instruct \
           --tasks mmlu,gsm8k,humaneval \
           --output ./run.json

# Verify reproducibility
anvil manifest verify run.json

# Diff two runs to find which fields explain a score gap
anvil manifest diff run.json other.json

DoLa — contrastive decoding out of the box

DoLa (Chuang et al. 2023) reduces hallucinations by contrasting logits from late layers vs. early layers at each decoding step. In Anvil it's a drop-in logits_processor:

from anvil.research import DoLa
from anvil.primitives import Generate, Sampler

result = engine.generate([
    Generate(
        prompt="The capital of France is",
        sampler=Sampler.greedy(),
        logits_processors=(DoLa(mature_layer=-1, premature_layers=(0, 12, 24)),),
    )
])
print(result[0].text)

The engine calls DoLa.bind(model) automatically before generation to cache lm_head.weight, then runs a step-by-step loop threading hidden states into the processor at every token.

CaaS preflight agent

Before any major run, Anvil runs a quality sentinel and a preflight check. If the rule engine + KB cannot diagnose the failure, it escalates to the LLM tier — a small coder model (Claude Haiku by default) that proposes a structured fix:

export ANTHROPIC_API_KEY=sk-ant-...
anvil eval --model my-model --tasks mmlu
# [caas/llm-tier] Proposed fix (confidence=0.82):
#   type=set_engine_flag flag=trust_remote_code value=true
#   Rationale: model config requires trust_remote_code=True
# Apply? [y/N]

LLM-proposed fixes always require explicit user confirmation. --caas=ci mode refuses them by construction. Disable entirely with ANVIL_LLM_TIER_DISABLED=1.

Multi-turn fewshot (instruct models)

Standard single-turn fewshot loses 5–15pp on MMLU for instruct models because the model never sees the pattern of short-answer assistant turns. MultiTurnFewshot fixes this by packing each exemplar as its own user/assistant exchange:

from anvil.tasks.base import MultiTurnFewshot

class MyMMLU(MultiTurnFewshot, MMLU):
    name = "mmlu_multiturn"
    # Each fewshot example → user/assistant message pair
    # Final question → user message scored against "A"/"B"/"C"/"D"

Multimodal

from PIL import Image
import anvil

m = anvil.load("Qwen/Qwen2.5-VL-7B-Instruct")
out = m.generate(messages=[{"role": "user", "content": [
    {"type": "image", "image": Image.open("cat.png")},
    {"type": "text",  "text": "What is in this image?"},
]}])
print(out.text)
print(out.image_token_counts)      # per-image vision-token counts

Custom modalities (RNA, audio, embeddings, anything)

from transformers import AutoModel
import anvil

model = anvil.load_custom(
    model_id="multimolecule/rnafm",
    model_class=AutoModel,
)

@anvil.register_task
class RNAFunctionRegression(anvil.Task):
    name = "rna_function_v1"
    dataset = "myorg/rna-function-set"

    def doc_to_request(self, doc):
        return anvil.Embed(input=doc["sequence"], pool="mean", layer=-1)

    def request_to_prediction(self, response, doc):
        return response.embedding

    def aggregate(self, predictions, docs):
        # your metric, your call — Spearman, Ridge probe, anything
        ...

result = anvil.eval(model=model, tasks=["rna_function_v1"])

Migrating from lm-evaluation-harness

# Before:
lm_eval --model vllm \
    --model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
    --tasks mmlu_pro,arc_challenge \
    --apply_chat_template \
    --num_fewshot 5 \
    --output_path ./out

# After:
anvil eval --model Qwen/Qwen2.5-7B-Instruct \
    --lm-eval-tasks mmlu_pro.yaml,arc_challenge.yaml \
    --n-fewshot 5 \
    --output ./run.json

# Validate the migration:
anvil eval --model Qwen/Qwen2.5-7B-Instruct \
    --lm-eval-tasks arc_challenge.yaml \
    --compare-with-lm-eval \
    --output ./run.json

OpenAI-compatible server

anvil serve --model Qwen/Qwen2.5-7B-Instruct --port 8000

# Drop-in replacement for the OpenAI client:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-checked")
resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

Tool calling is constrained-decoding-driven (one grammar; no per-model --tool-call-parser flag matrix).

Diagnosing your environment

anvil doctor
# anvil         ok    anvil 0.4.0
# python        ok    Python 3.11.15
# cuda          warn  CUDA not available — torch wheel may not match driver
# transformers  ok    transformers 4.57.6
# vllm          warn  vLLM is not installed
# hf_token      warn  HF_TOKEN is not set
# ...

anvil doctor --json    # machine-readable for CI

Design pillars

Research as a first-class user. Per-request logits processors (including DoLa), hidden-state extraction, structured output, and custom decoding strategies are stable, versioned public APIs.
Datasets and benchmarks integrate in five lines. A versioned task spec, batched evaluation primitives that drive the engine at full throughput, and a built-in library of the benchmarks that actually matter.
Day-zero model support, by default. New HuggingFace architectures load via the transformers backend the day they drop. The top architectures have a fast path.
Reproducibility by construction. Every run produces a manifest with the model SHA, dataset SHA, chat-template hash, sampler params, library version, and tokenizer version. Two runs with the same manifest produce identical numbers.
CaaS preflight agent. Rule engine + curated KB + LLM fallback tier runs before every major run, catches silent failures, and surfaces them as a reviewable diff.

Built-in benchmarks

GSM8K (M0), MMLU + MMLU-MultiTurn + HumanEval+ (M1), MMMU (M4). Tier 2 lm-evaluation-harness imports for the rest of the catalog. Tier 3 custom tasks for any modality.

Milestones

M0 — HF slow path, GSM8K, manifest emitted.
M1 — vLLM wrapper + ChatTemplate canonicalization + MMLU/HumanEval+.
M2 — Manifest canonical JSON + sign/verify/diff/replay/strip-caas.
M3 — CaaS rule engine + 15-entry KB + 10-case test corpus (70% auto-resolve, 0% false positive).
M4 — Multimodal (Qwen2.5-VL fast-path marker + MMMU + VLM-aware preflight).
M5 — lm-eval-harness shim + custom non-text modality (RNA example).
M6 — uv wheels (cu121/cu128/cu130), 5 fast paths, OpenAI-compatible serve, anvil doctor.
v0.2.0 — MultiTurnFewshot, Classify request type, HiddenStateSpec, per-request logits processors (HF + vLLM), real dataset SHAs, lm-eval task name resolution.
v0.3.0 — DoLa contrastive decoding, CaaS LLM tier (Anthropic + OpenAI-compatible).
v0.4.0 — MCP server (anvil mcp); five tools for AI research agents.

License

Apache-2.0. See LICENSE.

anvil
_{Anvil — the same manifest produces the same number, today, tomorrow, and on someone else's machine.}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

galoaa.b

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

May 15, 2026

0.3.1

May 15, 2026

0.3.0

May 15, 2026

0.2.0

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anvil_eval-0.4.0.tar.gz (359.4 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anvil_eval-0.4.0-py3-none-any.whl (180.5 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file anvil_eval-0.4.0.tar.gz.

File metadata

Download URL: anvil_eval-0.4.0.tar.gz
Upload date: May 15, 2026
Size: 359.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for anvil_eval-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`969762e1950c0364aa0177429f06637268ae976b77c5dcc9baba586d71361362`
MD5	`4e4d78a139a042b4f5268479ef904262`
BLAKE2b-256	`c8669c6d0258e0f0dcf76515069f7dd1fa596b6a4bc78d407e96d6f1d5aa146f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for anvil_eval-0.4.0.tar.gz:

Publisher: wheels.yml on bishoymoussa/anvil

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: anvil_eval-0.4.0.tar.gz
- Subject digest: 969762e1950c0364aa0177429f06637268ae976b77c5dcc9baba586d71361362
- Sigstore transparency entry: 1549130121
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: bishoymoussa/anvil@928beb2763c7fdcfada4966e6968e7201b4130bc
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/bishoymoussa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: wheels.yml@928beb2763c7fdcfada4966e6968e7201b4130bc
- Trigger Event: push

File details

Details for the file anvil_eval-0.4.0-py3-none-any.whl.

File metadata

Download URL: anvil_eval-0.4.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 180.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for anvil_eval-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5253526e2aeea9c339c304eb5e49196ef3834d690b3308e5cfcf4891420371df`
MD5	`ed301f7f49a4c33bbf5183e36ccff38d`
BLAKE2b-256	`05f78f4b45be644dd0c8bf8f4d0d80f78f0dfb6b22b239da5dcaf433a21f2fb5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for anvil_eval-0.4.0-py3-none-any.whl:

Publisher: wheels.yml on bishoymoussa/anvil

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: anvil_eval-0.4.0-py3-none-any.whl
- Subject digest: 5253526e2aeea9c339c304eb5e49196ef3834d690b3308e5cfcf4891420371df
- Sigstore transparency entry: 1549130163
- Sigstore integration time: May 15, 2026
Source repository:
- Permalink: bishoymoussa/anvil@928beb2763c7fdcfada4966e6968e7201b4130bc
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/bishoymoussa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: wheels.yml@928beb2763c7fdcfada4966e6968e7201b4130bc
- Trigger Event: push

anvil-eval 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

What this is

Install

MCP server — let your AI agent run evaluations

Quickstart

DoLa — contrastive decoding out of the box

CaaS preflight agent

Multi-turn fewshot (instruct models)

Multimodal

Custom modalities (RNA, audio, embeddings, anything)

Migrating from lm-evaluation-harness

OpenAI-compatible server

Diagnosing your environment

Design pillars

Built-in benchmarks

Milestones

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance