A research-first, evaluation-first inference library.
Project description
A research-first, evaluation-first inference library.
Design manuscript · Install · Quickstart · Milestones
Status: alpha (v0.2.0). All six v0 milestones (M0–M6) implemented and passing. Per-request logits processors (vLLM + HF), real dataset SHAs in manifests, and CI on Python 3.11/3.12 are live. Built per the design manuscript in
docs/design.md.Not yet in alpha: multi-turn fewshot,
Classifyrequest type, DoLa (v0.5). CaaS LLM tier is v1.
What this is
Anvil is not trying to be the fastest inference engine. vLLM and SGLang win throughput. Anvil's identity is correctness, reproducibility, and research ergonomics:
- Every run produces a content-hashed
Manifest: two runs with the same manifest must produce identical numbers, byte-for-byte. - Every chat template, tokenization, sampler, and image input is a versioned, hashed object — not a string loaded from a file at runtime.
- Day-zero new-model coverage via a transformers slow path; popular architectures graduate to a fast path.
- Per-request logits processors and hidden-state extraction are stable public APIs (the V0-vLLM API, restored).
- A preflight CaaS agent (rule engine + curated 15-entry KB) runs before every major run, catches the silent failures (missing chat template, EOS misconfigured, OOM-from-bad-config), and either fixes them or refuses to publish a manifest that crossed a silent regression.
See docs/design.md for the full design rationale.
Install
uv pip install anvil-eval
Wheels ship for cu121, cu128, cu130, plus a CPU fallback. The pure-Python install always works against any torch ≥ 2.4 / CUDA ≥ 12.1.
Import name: the Python package is still
import anvil— only the PyPI distribution name isanvil-eval.
For development:
uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[dev]"
Optional extras: .[vllm] for the vLLM backend, .[multimodal] for video/audio, .[xgrammar] for tool calling.
Quickstart
import anvil
result = anvil.eval(
model="meta-llama/Llama-3.1-8B-Instruct",
tasks=["mmlu", "gsm8k", "humaneval"],
)
print(result.scores) # {"mmlu": {...}, "gsm8k": {...}, ...}
result.manifest.save("run.json")
# CLI equivalent
anvil eval --model meta-llama/Llama-3.1-8B-Instruct \
--tasks mmlu,gsm8k,humaneval \
--output ./run.json
# Verify reproducibility
anvil manifest verify run.json
# Diff two runs to find which fields explain a score gap
anvil manifest diff run.json other.json
Multimodal
from PIL import Image
import anvil
m = anvil.load("Qwen/Qwen2.5-VL-7B-Instruct")
out = m.generate(messages=[{"role": "user", "content": [
{"type": "image", "image": Image.open("cat.png")},
{"type": "text", "text": "What is in this image?"},
]}])
print(out.text)
print(out.image_token_counts) # per-image vision-token counts
Custom modalities (RNA, audio, embeddings, anything)
from transformers import AutoModel
import anvil
model = anvil.load_custom(
model_id="multimolecule/rnafm",
model_class=AutoModel,
)
@anvil.register_task
class RNAFunctionRegression(anvil.Task):
name = "rna_function_v1"
dataset = "myorg/rna-function-set"
def doc_to_request(self, doc):
return anvil.Embed(input=doc["sequence"], pool="mean", layer=-1)
def request_to_prediction(self, response, doc):
return response.embedding
def aggregate(self, predictions, docs):
# your metric, your call — Spearman, Ridge probe, anything
...
result = anvil.eval(model=model, tasks=["rna_function_v1"])
Migrating from lm-evaluation-harness
# Before:
lm_eval --model vllm \
--model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
--tasks mmlu_pro,arc_challenge \
--apply_chat_template \
--num_fewshot 5 \
--output_path ./out
# After:
anvil eval --model Qwen/Qwen2.5-7B-Instruct \
--lm-eval-tasks mmlu_pro.yaml,arc_challenge.yaml \
--n-fewshot 5 \
--output ./run.json
# Validate the migration:
anvil eval --model Qwen/Qwen2.5-7B-Instruct \
--lm-eval-tasks arc_challenge.yaml \
--compare-with-lm-eval \
--output ./run.json
OpenAI-compatible server
anvil serve --model Qwen/Qwen2.5-7B-Instruct --port 8000
# Drop-in replacement for the OpenAI client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-checked")
resp = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)
Tool calling is constrained-decoding-driven (one grammar; no per-model --tool-call-parser flag matrix).
Diagnosing your environment
anvil doctor
# anvil ok anvil 0.0.1
# python ok Python 3.11.15
# cuda warn CUDA not available — torch wheel may not match driver
# transformers ok transformers 4.57.6
# vllm warn vLLM is not installed
# hf_token warn HF_TOKEN is not set
# ...
anvil doctor --json # machine-readable for CI
Design pillars
- Research as a first-class user. Per-request logits processors, hidden-state extraction, structured output, and custom decoding strategies are stable, versioned public APIs.
- Datasets and benchmarks integrate in five lines. A versioned task spec, batched evaluation primitives that drive the engine at full throughput, and a built-in library of the benchmarks that actually matter.
- Day-zero model support, by default. New HuggingFace architectures load via the transformers backend the day they drop. The top architectures have a fast path.
- Reproducibility by construction. Every run produces a manifest with the model SHA, dataset SHA, chat-template hash, sampler params, library version, and tokenizer version. Two runs with the same manifest produce identical numbers.
- CaaS preflight agent. A small rule engine + curated known-issue database runs a smoke test before any major run, catches silent failures, and surfaces them as a diff for review.
Built-in benchmarks (v0)
GSM8K (M0), MMLU + HumanEval+ (M1), MMMU (M4). Tier 2 lm-evaluation-harness imports for the rest of the catalog. Tier 3 custom tasks for any modality.
Milestones
The build proceeded milestone-by-milestone (docs/design.md §16.10), all green:
- M0 — HF slow path, GSM8K, manifest emitted.
- M1 — vLLM wrapper + ChatTemplate canonicalization + MMLU/HumanEval+.
- M2 — Manifest canonical JSON + sign/verify/diff/replay/strip-caas.
- M3 — CaaS rule engine + 15-entry KB + 10-case test corpus (70% auto-resolve, 0% false positive).
- M4 — Multimodal (Qwen2.5-VL fast-path marker + MMMU + VLM-aware preflight).
- M5 — lm-eval-harness shim + custom non-text modality (RNA example).
- M6 — uv wheels (cu121/cu128/cu130), 5 fast paths, OpenAI-compatible serve,
anvil doctor.
License
Apache-2.0. See LICENSE.
Anvil — the same manifest produces the same number, today, tomorrow, and on someone else's machine.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anvil_eval-0.2.0.tar.gz.
File metadata
- Download URL: anvil_eval-0.2.0.tar.gz
- Upload date:
- Size: 350.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
102d8ec5cd4918f71454ef6b8af8555556249a07ebd7274e8e86cd2a31b1ef7c
|
|
| MD5 |
9ea597762282e344de8f1d11ad2d8a66
|
|
| BLAKE2b-256 |
f50894640d277eb949a070ede5e9ff8d955623e3acf090d80572a76adc50164e
|
Provenance
The following attestation bundles were made for anvil_eval-0.2.0.tar.gz:
Publisher:
wheels.yml on bishoymoussa/anvil
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
anvil_eval-0.2.0.tar.gz -
Subject digest:
102d8ec5cd4918f71454ef6b8af8555556249a07ebd7274e8e86cd2a31b1ef7c - Sigstore transparency entry: 1540128803
- Sigstore integration time:
-
Permalink:
bishoymoussa/anvil@9f8cd64e20a3d1ae040946eb884ce4c75a84bcf9 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/bishoymoussa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
wheels.yml@9f8cd64e20a3d1ae040946eb884ce4c75a84bcf9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file anvil_eval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: anvil_eval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 170.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f21a80065e5a0de6ecfd13a2bec8a82b094013983acaa69dc167fee5de963fe
|
|
| MD5 |
d5ca5bedca1fbbd9c5a52a10503662d1
|
|
| BLAKE2b-256 |
f4bcc4832857f360a89ce54ad1ca757567941f4a940da00488770bd2313b0a55
|
Provenance
The following attestation bundles were made for anvil_eval-0.2.0-py3-none-any.whl:
Publisher:
wheels.yml on bishoymoussa/anvil
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
anvil_eval-0.2.0-py3-none-any.whl -
Subject digest:
0f21a80065e5a0de6ecfd13a2bec8a82b094013983acaa69dc167fee5de963fe - Sigstore transparency entry: 1540129033
- Sigstore integration time:
-
Permalink:
bishoymoussa/anvil@9f8cd64e20a3d1ae040946eb884ce4c75a84bcf9 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/bishoymoussa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
wheels.yml@9f8cd64e20a3d1ae040946eb884ce4c75a84bcf9 -
Trigger Event:
push
-
Statement type: