Longevity LLM benchmark CLI — Estimathon-style evaluation for aging-biology tasks

These details have not been verified by PyPI

Project links

Repository

Project description

Murphy — Longevity Benchmark CLI

Evaluate any LLM on aging-biology tasks using an Estimathon-style benchmark. Models submit intervals [min, max] for numerical questions, receive only binary feedback (GOOD / BAD), and manage a shared submission budget across all problems. Non-numerical tasks (binary, multiclass, ternary, generation) are scored with standard accuracy / F1.

Install

cd longivity_hack
pip install -r requirements.txt

First-time setup

Run the setup wizard inside the chat to configure keys and verify dataset access:

python cli.py

Then type:

/setup

The wizard walks through:

Anthropic API key — required for the chat interface and --provider anthropic runs
HuggingFace token — required for LongeBench dataset; wizard verifies live access
OpenAI API key — optional

Keys are saved to ~/.longevity/config.json and masked on input.

LongeBench is a gated dataset. Before your token will work, visit huggingface.co/datasets/insilicomedicine/longebench and click Request access. Approval is usually instant. Then re-run /setup to verify.

Interactive chat (recommended)

python cli.py          # opens chat directly
python cli.py chat     # same thing

Type naturally — Claude calls the right tools. Type / to see all commands with Tab autocomplete.

Command	Args	Description
`/setup`		Configure API keys + verify HuggingFace access
`/help`		Show all commands
`/benchmark`	`[model] [provider] [tasks]`	Quick-run with current defaults
`/question_set`	`[source] [limit]`	Preview tasks
`/status`	`[model] [provider]`	Check model connectivity
`/model`	`[id]`	Show or set benchmark model
`/provider`	`[name]`	Show or set provider
`/tasks`	`[source]`	Show or set default task source
`/think`		Toggle chain-of-thought traces
`/config`	`[key] [value]`	View or set a config value
`/clear`		Clear conversation history
`/exit`		Exit

Running benchmarks

Full LongeBench — mixed mode (recommended)

Runs Estimathon on numerical tasks and one-shot accuracy on categorical tasks:

python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks longebench \
  --mode mixed \
  --limit 50

Estimathon only (numerical tasks)

python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks sample \
  --mode estimathon \
  --think

One-shot baseline

python cli.py run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks longebench \
  --mode one-shot \
  --limit 100

Against the L-LLM endpoint

python cli.py run \
  --model longevity-llm \
  --provider endpoint \
  --endpoint https://saujlffcxf20v74m.us-east-2.aws.endpoints.huggingface.cloud \
  --api-key <hf-token> \
  --tasks longebench \
  --mode mixed \
  --limit 50

Estimathon rules

score = (10 + Σ floor(max/min) for GOOD final answers) × 2^(N − # good final answers)

Only the last submission per problem counts
Refining a GOOD interval is a voluntary bet — if the new interval misses, you lose that problem
Feedback is binary only: GOOD or BAD — no "too high / too low"
Default budget: floor(1.38 × N) slips across all N problems
Lower score is better

Refinement accuracy — the key signal: of all voluntary bets on GOOD intervals, what fraction paid off? Random guessing wins ~50%. Significantly above 50% means genuine biological reasoning.

Two-track scoring in mixed mode

Track	Task formats	Scoring
Estimathon	regression, pairwise	Interval score + refinement accuracy
One-shot	binary, multiclass, ternary	Exact-match accuracy
One-shot	generation (gene lists)	Token F1 ≥ 0.5 = correct

Providers

`--provider`	Connects to	Credential
`anthropic`	Anthropic API	`anthropic.api_key` / `ANTHROPIC_API_KEY`
`endpoint`	Any OpenAI-compatible URL	`--api-key` + `--endpoint`
`hf`	HuggingFace Inference API	`hf.token` / `HF_TOKEN`
`openai`	OpenAI API	`openai.api_key` / `OPENAI_API_KEY`

Task sources

`--tasks`	Loads
`sample`	7 built-in tasks — no network required
`longebench`	Full LongeBench benchmark (HuggingFace, gated)
`longebench:extra`	LongeBench extra split
`path/to/file.jsonl`	Local JSONL file

Output

Results written to results.jsonl. Fields include:

Estimathon track

final_score — Estimathon score (lower is better)
n_good_final / n_problems — problems solved
slips_used / total_budget
refinement_accuracy — fraction of refinement bets that succeeded
slip_log — every submission with GOOD/BAD, width factor, score delta
think — per-slip chain-of-thought trace (with --think)

One-shot track

correct — boolean per task
f1 — for generation tasks
by_format — accuracy breakdown per format

Project structure

longivity_hack/
├── cli.py                  Typer entry point
├── requirements.txt
├── idea.md                 Benchmark design document
├── devlog.md               Development log
├── CLAUDE.md               Developer guide for teammates
└── benchmark/
    ├── chat.py             Interactive chat UI (Claude tool-use, /setup wizard, slash autocomplete)
    ├── runner.py           Estimathon session, one-shot eval, run_mixed()
    ├── loader.py           Task loading — sample / LongeBench / local JSONL
    ├── client.py           Unified model client (all providers)
    ├── config.py           ~/.longevity/config.json
    └── results.py          JSONL writer / reader

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.3.1

May 24, 2026

This version

0.2.0

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

murthy_bench-0.2.0.tar.gz (40.8 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

murthy_bench-0.2.0-py3-none-any.whl (43.4 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file murthy_bench-0.2.0.tar.gz.

File metadata

Download URL: murthy_bench-0.2.0.tar.gz
Upload date: May 24, 2026
Size: 40.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for murthy_bench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`972e814c54ff04143ea8f9024e31200fc64459f8b403042428c280b97b85ccb8`
MD5	`9645e1fa5048fe1d28970389428118c1`
BLAKE2b-256	`d3b55d4c92da9ea1f19c5dc965ea297f839526f601c672ca2db6b5633d3ad56b`

See more details on using hashes here.

File details

Details for the file murthy_bench-0.2.0-py3-none-any.whl.

File metadata

Download URL: murthy_bench-0.2.0-py3-none-any.whl
Upload date: May 24, 2026
Size: 43.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for murthy_bench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ae838637487eb7a50c9d798f67eda93deb76f59bb753b14ec2bdc5a3adfbbe50`
MD5	`cfd292650d4b10556912c36b5373a2a7`
BLAKE2b-256	`22627f56e0107101ce5df1b90180794c0260775b065e4159eb8f41d472bce1a9`

See more details on using hashes here.

murthy-bench 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Murphy — Longevity Benchmark CLI

Install

First-time setup

Interactive chat (recommended)

Running benchmarks

Full LongeBench — mixed mode (recommended)

Estimathon only (numerical tasks)

One-shot baseline

Against the L-LLM endpoint

Estimathon rules

Two-track scoring in mixed mode

Providers

Task sources

Output

Project structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes