Longevity LLM benchmark CLI — Estimathon-style evaluation for aging-biology tasks

These details have not been verified by PyPI

Project links

Repository

Project description

Murphy — Longevity Benchmark CLI

Evaluate any LLM on aging-biology tasks using an Estimathon-style benchmark. Models submit intervals [min, max] for numerical questions, receive only binary feedback (GOOD / BAD), and manage a shared submission budget across all problems. Non-numerical tasks (binary, multiclass, ternary, generation) are scored with standard accuracy / F1.

Install

pip install murthy-bench

Quick start

murthy

On first launch, Murphy runs a setup wizard to configure your API keys and verify dataset access. Keys are saved to ~/.longevity/config.json and masked on input.

The wizard walks through:

Anthropic API key — required for the chat interface and --provider anthropic runs
HuggingFace token — required for LongeBench dataset; wizard verifies live access
OpenAI API key — optional

LongeBench is a gated dataset. Before your HF token will work, visit huggingface.co/datasets/insilicomedicine/longebench and click Request access. Approval is usually instant. Then re-run /setup to re-verify.

You can re-run setup at any time from inside the chat:

/setup

Set keys manually

murthy config set anthropic.api_key  sk-ant-...
murthy config set hf.token           hf_...
murthy config list

Interactive chat

murthy

Type naturally — Claude calls the right tools. Type / to see all commands with Tab autocomplete.

Command	Args	Description
`/setup`		Re-run the API key wizard
`/help`		Show all commands
`/test`	`[model]`	Estimathon trial: 20 LongeBench tasks, 40-slip budget
`/benchmark`	`[model] [provider] [tasks]`	Quick-run with current defaults
`/explore`		Show all unique LongeBench task types + Estimathon compatibility
`/question_set`	`[source] [limit]`	Preview tasks
`/model`	`list \| search \| <id>`	List, search, or set benchmark model
`/batch`	`<models> [provider]`	Benchmark multiple models in sequence
`/add`	`<model_id> \| refresh`	Add model to list or refresh from HuggingFace
`/provider`	`[name]`	Show or set provider
`/tasks`	`[source]`	Show or set default task source
`/think`		Toggle chain-of-thought traces
`/status`	`[model] [provider]`	Check model connectivity
`/config`	`[key] [value]`	View or set a config value
`/clear`		Clear conversation history
`/exit`		Exit

Running benchmarks (CLI)

Full LongeBench — mixed mode (recommended)

murthy run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks longebench \
  --mode mixed \
  --limit 50

Estimathon only (numerical tasks)

murthy run \
  --model claude-sonnet-4-6 \
  --provider anthropic \
  --tasks sample \
  --mode estimathon \
  --think

Against the L-LLM endpoint

murthy run \
  --model longevity-llm \
  --provider endpoint \
  --endpoint https://saujlffcxf20v74m.us-east-2.aws.endpoints.huggingface.cloud \
  --api-key <hf-token> \
  --tasks longebench \
  --mode mixed \
  --limit 50

Estimathon rules

score = (10 + Σ floor(max/min) for GOOD final answers) × 2^(N − # good final answers)

Only the last submission per problem counts
Refining a GOOD interval is a voluntary bet — if the new interval misses, you lose that problem
Feedback is binary only: GOOD or BAD — no "too high / too low"
Default budget: floor(18/13 × N) slips across all N problems (matching real Estimathon's 18-slip / 13-problem ratio)
Lower score is better

Refinement accuracy — the key signal: of all voluntary bets on GOOD intervals, what fraction paid off? Random guessing wins ~50%. Significantly above 50% means genuine biological reasoning.

Two-track scoring in mixed mode

Track	Task formats	Scoring
Estimathon	regression	Interval score + refinement accuracy
One-shot	binary, multiclass, ternary	Exact-match accuracy
One-shot	generation (gene lists)	Token F1 ≥ 0.5 = correct

Providers

`--provider`	Connects to	Credential
`anthropic`	Anthropic API	`anthropic.api_key` / `ANTHROPIC_API_KEY`
`endpoint`	Any OpenAI-compatible URL	`--api-key` + `--endpoint`
`hf`	HuggingFace Inference API	`hf.token` / `HF_TOKEN`
`openai`	OpenAI API	`openai.api_key` / `OPENAI_API_KEY`

Task sources

`--tasks`	Loads
`sample`	7 built-in tasks — no network required
`longebench`	Full LongeBench benchmark (HuggingFace, gated)
`longebench:extra`	LongeBench extra split
`path/to/file.jsonl`	Local JSONL file

Output

Results written to results.jsonl. Fields include:

Estimathon track

final_score — Estimathon score (lower is better)
n_good_final / n_problems — problems solved
slips_used / total_budget
refinement_accuracy — fraction of refinement bets that succeeded
slip_log — every submission with GOOD/BAD, width factor, score delta
think — per-slip chain-of-thought trace (with --think)

One-shot track

correct — boolean per task
f1 — for generation tasks
by_format — accuracy breakdown per format

Project structure

longivity_hack/
├── cli.py                  Typer entry point (run / chat / status / tasks / config / group / compare)
├── hf_llm_models.csv       300+ HuggingFace models for /model search and /add refresh
├── idea.md                 Benchmark design rationale
├── devlog.md               Development log
├── CLAUDE.md               Developer guide for contributors
└── benchmark/
    ├── chat.py             Interactive chat UI (first-run wizard, slash commands)
    ├── runner.py           Estimathon session, one-shot eval, run_mixed()
    ├── loader.py           Task loading — sample / LongeBench / local JSONL
    ├── client.py           Unified model client (all providers)
    ├── config.py           ~/.longevity/config.json
    ├── results.py          JSONL writer / reader
    └── model_manager.py    CSV-backed model browser (/model, /add, /batch)

For a local dev setup (cloning and running from source), see SETUP.md.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.3.1

May 24, 2026

0.2.0

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

murthy_bench-0.3.1.tar.gz (53.3 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

murthy_bench-0.3.1-py3-none-any.whl (56.0 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file murthy_bench-0.3.1.tar.gz.

File metadata

Download URL: murthy_bench-0.3.1.tar.gz
Upload date: May 24, 2026
Size: 53.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for murthy_bench-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`72ffb4cec42a17414e152d4c2d3ab792ce48819ad627cf5ab3d635fce74bdfeb`
MD5	`81a97fbc0298f9ac3689c48f144bc286`
BLAKE2b-256	`8a315c75887b078a12bd9f049d0afaf7c6f5ef23f4608cbb76a05ae8f0ee88aa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for murthy_bench-0.3.1.tar.gz:

Publisher: publish.yml on OhhMoo/murthy-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: murthy_bench-0.3.1.tar.gz
- Subject digest: 72ffb4cec42a17414e152d4c2d3ab792ce48819ad627cf5ab3d635fce74bdfeb
- Sigstore transparency entry: 1624682163
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: OhhMoo/murthy-bench@8397a7bcecfc768e464bee0a03e4b9f4103776bc
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/OhhMoo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8397a7bcecfc768e464bee0a03e4b9f4103776bc
- Trigger Event: push

File details

Details for the file murthy_bench-0.3.1-py3-none-any.whl.

File metadata

Download URL: murthy_bench-0.3.1-py3-none-any.whl
Upload date: May 24, 2026
Size: 56.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for murthy_bench-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bff783bafc7ab8209b008c8ef5ea4c99a07df161f0bbcac1483357e1bdb3f7b4`
MD5	`115cdfccd689589bafe9f93cfe20f06a`
BLAKE2b-256	`05c1e77eb0db2a4cc1e526c90963ea981b7243890a37cdb8c7093a42070d1547`

See more details on using hashes here.

Provenance

The following attestation bundles were made for murthy_bench-0.3.1-py3-none-any.whl:

Publisher: publish.yml on OhhMoo/murthy-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: murthy_bench-0.3.1-py3-none-any.whl
- Subject digest: bff783bafc7ab8209b008c8ef5ea4c99a07df161f0bbcac1483357e1bdb3f7b4
- Sigstore transparency entry: 1624682168
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: OhhMoo/murthy-bench@8397a7bcecfc768e464bee0a03e4b9f4103776bc
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/OhhMoo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8397a7bcecfc768e464bee0a03e4b9f4103776bc
- Trigger Event: push

murthy-bench 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Murphy — Longevity Benchmark CLI

Install

Quick start

Set keys manually

Interactive chat

Running benchmarks (CLI)

Full LongeBench — mixed mode (recommended)

Estimathon only (numerical tasks)

Against the L-LLM endpoint

Estimathon rules

Two-track scoring in mixed mode

Providers

Task sources

Output

Project structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance