Skip to main content

Longevity LLM benchmark CLI — Estimathon-style evaluation for aging-biology tasks

Project description

Murphy — Longevity Benchmark CLI

Evaluate any LLM on aging-biology tasks using an Estimathon-style benchmark. Models submit intervals [min, max] for numerical questions, receive only binary feedback (GOOD / BAD), and manage a shared submission budget across all problems. Non-numerical tasks (binary, multiclass, ternary, generation) are scored with standard accuracy / F1.


Install

cd longivity_hack
pip install -r requirements.txt

First-time setup

Run the setup wizard inside the chat to configure keys and verify dataset access:

python cli.py

Then type:

/setup

The wizard walks through:

  1. Anthropic API key — required for the chat interface and --provider anthropic runs
  2. HuggingFace token — required for LongeBench dataset; wizard verifies live access
  3. OpenAI API key — optional

Keys are saved to ~/.longevity/config.json and masked on input.

LongeBench is a gated dataset. Before your token will work, visit huggingface.co/datasets/insilicomedicine/longebench and click Request access. Approval is usually instant. Then re-run /setup to verify.


Interactive chat (recommended)

python cli.py          # opens chat directly
python cli.py chat     # same thing

Type naturally — Claude calls the right tools. Type / to see all commands with Tab autocomplete.

Command Args Description
/setup Configure API keys + verify HuggingFace access
/help Show all commands
/benchmark [model] [provider] [tasks] Quick-run with current defaults
/question_set [source] [limit] Preview tasks
/status [model] [provider] Check model connectivity
/model [id] Show or set benchmark model
/provider [name] Show or set provider
/tasks [source] Show or set default task source
/think Toggle chain-of-thought traces
/config [key] [value] View or set a config value
/clear Clear conversation history
/exit Exit

Running benchmarks

Full LongeBench — mixed mode (recommended)

Runs Estimathon on numerical tasks and one-shot accuracy on categorical tasks:

python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks longebench \
  --mode mixed \
  --limit 50

Estimathon only (numerical tasks)

python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks sample \
  --mode estimathon \
  --think

One-shot baseline

python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks longebench \
  --mode one-shot \
  --limit 100

Against the L-LLM endpoint

python cli.py run \
  --model longevity-llm \
  --provider endpoint \
  --endpoint https://saujlffcxf20v74m.us-east-2.aws.endpoints.huggingface.cloud \
  --api-key <hf-token> \
  --tasks longebench \
  --mode mixed \
  --limit 50

Estimathon rules

score = (10 + Σ floor(max/min) for GOOD final answers) × 2^(N − # good final answers)
  • Only the last submission per problem counts
  • Refining a GOOD interval is a voluntary bet — if the new interval misses, you lose that problem
  • Feedback is binary only: GOOD or BAD — no "too high / too low"
  • Default budget: floor(1.38 × N) slips across all N problems
  • Lower score is better

Refinement accuracy — the key signal: of all voluntary bets on GOOD intervals, what fraction paid off? Random guessing wins ~50%. Significantly above 50% means genuine biological reasoning.

Two-track scoring in mixed mode

Track Task formats Scoring
Estimathon regression, pairwise Interval score + refinement accuracy
One-shot binary, multiclass, ternary Exact-match accuracy
One-shot generation (gene lists) Token F1 ≥ 0.5 = correct

Providers

--provider Connects to Credential
anthropic Anthropic API anthropic.api_key / ANTHROPIC_API_KEY
endpoint Any OpenAI-compatible URL --api-key + --endpoint
hf HuggingFace Inference API hf.token / HF_TOKEN
openai OpenAI API openai.api_key / OPENAI_API_KEY

Task sources

--tasks Loads
sample 7 built-in tasks — no network required
longebench Full LongeBench benchmark (HuggingFace, gated)
longebench:extra LongeBench extra split
path/to/file.jsonl Local JSONL file

Output

Results written to results.jsonl. Fields include:

Estimathon track

  • final_score — Estimathon score (lower is better)
  • n_good_final / n_problems — problems solved
  • slips_used / total_budget
  • refinement_accuracy — fraction of refinement bets that succeeded
  • slip_log — every submission with GOOD/BAD, width factor, score delta
  • think — per-slip chain-of-thought trace (with --think)

One-shot track

  • correct — boolean per task
  • f1 — for generation tasks
  • by_format — accuracy breakdown per format

Project structure

longivity_hack/
├── cli.py                  Typer entry point
├── requirements.txt
├── idea.md                 Benchmark design document
├── devlog.md               Development log
├── CLAUDE.md               Developer guide for teammates
└── benchmark/
    ├── chat.py             Interactive chat UI (Claude tool-use, /setup wizard, slash autocomplete)
    ├── runner.py           Estimathon session, one-shot eval, run_mixed()
    ├── loader.py           Task loading — sample / LongeBench / local JSONL
    ├── client.py           Unified model client (all providers)
    ├── config.py           ~/.longevity/config.json
    └── results.py          JSONL writer / reader

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

murthy_bench-0.2.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

murthy_bench-0.2.0-py3-none-any.whl (43.4 kB view details)

Uploaded Python 3

File details

Details for the file murthy_bench-0.2.0.tar.gz.

File metadata

  • Download URL: murthy_bench-0.2.0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for murthy_bench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 972e814c54ff04143ea8f9024e31200fc64459f8b403042428c280b97b85ccb8
MD5 9645e1fa5048fe1d28970389428118c1
BLAKE2b-256 d3b55d4c92da9ea1f19c5dc965ea297f839526f601c672ca2db6b5633d3ad56b

See more details on using hashes here.

File details

Details for the file murthy_bench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: murthy_bench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 43.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for murthy_bench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ae838637487eb7a50c9d798f67eda93deb76f59bb753b14ec2bdc5a3adfbbe50
MD5 cfd292650d4b10556912c36b5373a2a7
BLAKE2b-256 22627f56e0107101ce5df1b90180794c0260775b065e4159eb8f41d472bce1a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page