Skip to main content

A clinical evaluation framework for large language models.

Project description

Krisis

Clinical evaluation framework for testing LLM safety behavior in medical reasoning.

Krisis evaluates not only whether an LLM is correct, but whether it knows when to abstain, defer, or express uncertainty in high-stakes clinical tasks.

Why Krisis

Krisis grew out of Cady AI, an earlier CKD detection chatbot presented at a national AI hackathon. Cady AI used a model trained on the UCI Chronic Kidney Disease dataset to predict CKD/not-CKD, return class probabilities, and attribute which lab results pushed risk upward.

That project exposed the next safety question: as LLMs become more fluent in clinical reasoning, can they recognize cases where they should not confidently answer? Krisis turns that question into a reusable evaluation framework: a human-in-the-loop type system for checking whether LLMs can defer, abstain, and express uncertainty before their outputs are trusted.

What Krisis Does

Krisis provides:

  • clinical task suites that produce structured patient records
  • a unified API backend for OpenAI, Anthropic, Grok, Gemini, and other OpenRouter-routed models
  • batched and concurrent benchmark execution
  • retry/backoff handling for transient provider failures
  • structured parsing of model predictions, confidence, and abstentions
  • abstention-aware metrics beyond plain accuracy
  • text, full JSON, and metrics-only JSON reports
  • execution metadata such as runtime, throughput, batch size, concurrency, and token usage

Research Status And Limitations

Krisis v0.2 currently includes one implemented suite: Chronic Kidney Disease (CKD), based on the UCI CKD dataset.

Supported CKD tasks:

  • detection: CKD vs not CKD
  • staging: CKD stage classification
  • progression: synthetic progression stress test

Important limitations:

  • CKD is the only available suite in v0.2.
  • The UCI CKD dataset is small and cross-sectional.
  • Progression is synthetic because the source dataset is not longitudinal.
  • Krisis is for research and evaluation only. It is not a medical device and must not be used to diagnose or treat patients.
  • Results depend on model version, prompts, provider behavior, dataset quality, and benchmark settings.

Installation

Install Krisis:

pip install krisis

Install API model support:

pip install "krisis[api]"

Then create an API key from OpenRouter and set it locally:

export OPENROUTER_API_KEY="..."

Hugging Face support will use the hf extra when implemented:

pip install "krisis[hf]"

Quickstart

Warning Krisis v0.2 only includes the CKD suite. The UCI CKD CSV is not bundled with the package; download it locally and pass its path to CKDSuite.

from krisis.backends.api import APIBackend
from krisis.benchmark import Benchmark
from krisis.data.base import FeatureSet, SuiteConfig, Task
from krisis.data.ckd.suite import CKDSuite
from krisis.results.report import format_report

suite = CKDSuite(
    config=SuiteConfig(
        features=FeatureSet.FULL,
        task=Task.DETECTION,
        seed=42,
        n_synthetic=80,
        test_size=0.2,
    ),
    data_path="datasets/ckd/ckd_full.csv",
)

backend = APIBackend(
    model="openai/gpt-5.5",
    api_key="YOUR_OPENROUTER_API_KEY",
    reasoning_effort="low",
)

result = Benchmark(
    suite,
    backend,
    batch_size=8,
    max_concurrency=2,
).run()

print(format_report(result))

Outputs

Krisis supports three report styles.

Text report:

from krisis.results.report import format_report

print(format_report(result))

Full JSON report:

from krisis.results.report import format_json_report

print(format_json_report(result, include_results=True))

Metrics-only JSON report for plotting/model comparison:

from krisis.results.report import format_metrics_json_report

print(format_metrics_json_report(result))

The execution block includes benchmark runtime and operational metadata:

{
  "batch_size": 8,
  "max_concurrency": 2,
  "n_input_records": 160,
  "n_api_batches": 20,
  "elapsed_seconds": 42.18,
  "records_per_second": 3.79,
  "input_tokens": 12000,
  "output_tokens": 2400,
  "token_total": 14400
}

Core Concepts

  • Suite: prepares a clinical dataset/task and returns patient records.
  • Backend: adapts a model provider to Krisis' standard response shape.
  • Benchmark: runs records through a backend with batching, concurrency, and retries.
  • Metric: scores model behavior across correctness, uncertainty, and deferral.
  • Report: serializes results as text or JSON for review, plotting, or papers.

Metrics

Krisis includes:

  • Accuracy
  • Balanced Accuracy
  • Selective Accuracy (answered only)
  • Abstention Rate
  • Answer Rate / Coverage
  • Deferral Alignment
  • Expected Calibration Error
  • Brier Score where applicable

Selective accuracy separates how often the model was right when it answered from how often it chose not to answer.

Model Backends

Route Backend Example model
API APIBackend openai/gpt-5.5
API APIBackend anthropic/claude-opus-4.7
API APIBackend x-ai/grok-4.3
API APIBackend google/gemini-3.5-flash

All backends return the same structured fields:

prediction
abstained
confidence
raw_response
input_tokens
output_tokens
total_tokens

Citation

If you use Krisis in research, please cite it as software:

@software{watila_krisis_2026,
  author = {Watila, Emmanuel},
  title = {Krisis: A Clinical Evaluation Framework for Large Language Models},
  year = {2026},
  version = {0.2.0},
  url = {https://github.com/devsgnr/krisis}
}

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krisis-0.2.0.tar.gz (49.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krisis-0.2.0-py3-none-any.whl (68.9 kB view details)

Uploaded Python 3

File details

Details for the file krisis-0.2.0.tar.gz.

File metadata

  • Download URL: krisis-0.2.0.tar.gz
  • Upload date:
  • Size: 49.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for krisis-0.2.0.tar.gz
Algorithm Hash digest
SHA256 208800823f9daf2c43699ef13460d93506787aaa88125183235904cff7857a16
MD5 70de5500d97fd15e486adcb9b97eae95
BLAKE2b-256 9edb6288fd089872221177da0d8233781f125b8530651da038e3ceb8aa466539

See more details on using hashes here.

File details

Details for the file krisis-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: krisis-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 68.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for krisis-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 344a8e10398d351dcd7944d1e91fffe39da5081b7ccb0b0684515d5d254aca43
MD5 2bdeff945f5eda14cec3e1b94a464574
BLAKE2b-256 349230bd3f98d55c8494287b69e618b35dfb5ae038d0e2677c4b716172c6d2b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page