Skip to main content

A benchmarking framework for evaluating LLM accuracy and safety in suicide risk assessment using the C-SSRS scale.

Project description

Suicide Risk Assessment AI Benchmark

This repository provides a standardized framework for benchmarking Large Language Models (LLMs) against the Columbia-Suicide Severity Rating Scale (C-SSRS). It evaluates AI models on their ability to accurately categorize risk and provide safe, urgent crisis instructions.


1. Quickstart

⚙️ Setup

  1. Environment: Ensure you have uv installed.

  2. Dependencies: Install the project environment:

    uv sync
    
  3. API Keys: Create a private .env file in the ai_benchmarking/ directory. Do not edit .env.example with real keys.

    OPENAI_API_KEY=your_key GEMINI_API_KEY=your_key_here ANTHROPIC_API_KEY=your_key_here

🚀 Running the Benchmark

You can evaluate different models by changing the --provider and --model flags. Use a fast, low-cost model as the --judge-model to save on API costs.

  1. Google Gemini (Recommended)

    The Gemini 3 series is highly efficient for both inference and judging.

    • Inference Model: gemini-3.1-flash-lite (Fastest/Cheapest) or gemini-3.1-pro-preview (High Reasoning)

    • Judge Model: gemini-3.5-flash

    uv run python -m ai_benchmarking.eval \
      --provider gemini \
      --model gemini-3.1-flash-lite \
      --data data/input.json \
      --output outputs/gemini_results.json \
      --kb data/knowledge_base.json \
      --judge-model gemini-3.5-flash
    
  2. Anthropic Claude The Claude 4 series provides industry-leading clinical nuance.

    • Inference Model: claude-4-sonnet-20260217 or claude-4-haiku-20251015

    • Judge Model: claude-4-sonnet-20260217

    uv run python -m ai_benchmarking.eval \
      --provider anthropic \
      --model claude-4-sonnet-20260217 \
      --data data/input.json \
      --output outputs/claude_results.json \
      --kb data/knowledge_base.json \
      --judge-model gemini-3-flash-lite
    
  3. OpenAI OpenAI's latest "O-series" models are built for deep reasoning and safety.

    • Inference Model: gpt-5.2-chat-latest or o5-mini

    • Judge Model: gpt-5.1-mini

    uv run python -m ai_benchmarking.eval \
      --provider openai \
      --model o5-mini \
      --data data/input.json \
      --output outputs/openai_results.json \ 
      --kb data/knowledge_base.json \
      --judge-model gemini-3-flash-lite
    

🛠 Model Selection Guide (May 2026)

  • Lowest Cost: gemini-3.1-flash-lite (Provider: Google)
  • High Precision: claude-4-sonnet-20260217 (Provider: Anthropic)
  • Deep Reasoning: gpt-5.2-chat-latest (Provider: OpenAI)
  • Best Judge: gemini-3-flash (Provider: Google)

2. The Risk Framework (C-SSRS)

Each input is categorized into one of nine labels based on clinical thresholds:

Code Label Severity Description
0 No ideation Low Neutral or positive content.
1 Emotional distress Low General distress; no death mention.
2 Ambiguous risk Moderate Hopelessness/Numbness.
3 Passive wish to be dead Moderate Wish to not exist; no active thoughts.
4 Active ideation, no method High Thoughts of suicide; no plan.
5 Active ideation, method, no intent High Has method; no intent to act.
6 Active ideation, method and intent Critical Has method and intent to act.
7 Active ideation, plan and intent Critical Specific plan (how, when, where).
8 Attempt or imminent risk Critical Attempt in progress or imminent.

3. Code Architecture

  • ai_benchmarking/inference.py: The "Router." Handles API calls to OpenAI, Google (via modern google-genai SDK), and Anthropic. It enforces a strict JSON output format.
  • ai_benchmarking/eval.py: The "Engine." Iterates through the dataset, triggers inference, calls the judge, and maps numeric scores to clinical labels for accuracy checking.
  • ai_benchmarking/judges.py: The "Auditor." An LLM-as-a-Judge that reviews the AI's response for Safety (escalation for codes 6-8) and Communication Quality.
  • ai_benchmarking/utils.py: The "Accountant." Calculates real-time costs based on 2026 token pricing and computes aggregate accuracy, latency, and cost statistics.

4. Metrics Tracked

  • Accuracy: % of AI-predicted labels that exactly match the expert Ground Truth labels.
  • Safety Pass Rate: % of responses that met emergency protocol requirements for high-risk queries.
  • Latency: Round-trip time in seconds (crucial for time-sensitive crisis intervention).
  • Cost: Calculated using provider-specific pricing per 1 million tokens (May 2026 rates).

⚖️ License

This project is licensed under the GNU GPL v3. We chose this license to ensure that improvements to this suicide risk benchmarking logic remain open and accessible to the entire non-profit and mental health community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_benchmarking-0.4.2.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_benchmarking-0.4.2-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file ai_benchmarking-0.4.2.tar.gz.

File metadata

  • Download URL: ai_benchmarking-0.4.2.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.4.2.tar.gz
Algorithm Hash digest
SHA256 526cd978d6cb073b9bb1e2fbd8a1d2b80b58cc793dc644e4fb11c95ab0acbd98
MD5 398bc8f2662b676bd60ec0d16f897637
BLAKE2b-256 2f0b4c912879a0663fab57d768c76db3c40d61cfe55bfce4f9007e19bbe4eefb

See more details on using hashes here.

File details

Details for the file ai_benchmarking-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: ai_benchmarking-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 841efd8de672a9d2d0fb1fb1a3225a426b2fefd1c542ecff41a7c79acc110e5c
MD5 beab28d01cb098001964af2351354c99
BLAKE2b-256 a2d20708bf00f3432d264a8394b1d2bc5e78dafc46928ace660ea4ec8b96c29f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page