Skip to main content

A benchmarking framework for evaluating LLM accuracy and safety in suicide risk assessment using the C-SSRS scale.

Project description

Suicide Risk Assessment AI Benchmark

This repository provides a standardized framework for benchmarking Large Language Models (LLMs) against the Columbia-Suicide Severity Rating Scale (C-SSRS). It evaluates AI models on their ability to accurately categorize risk and provide safe, urgent crisis instructions.


1. Quickstart

⚙️ Setup

  1. Environment: Ensure you have uv installed.

  2. Dependencies: Install the project environment:

    uv sync
    
  3. API Keys: Create a private .env file in the ai_benchmarking/ directory. Do not edit .env.example with real keys.

    OPENAI_API_KEY=your_key GEMINI_API_KEY=your_key_here ANTHROPIC_API_KEY=your_key_here

🚀 Running the Benchmark

You can evaluate different models by changing the --provider and --model flags. Use a fast, low-cost model as the --judge-model to save on API costs.

  1. Google Gemini (Recommended)

    The Gemini 3 series is highly efficient for both inference and judging.

    • Inference Model: gemini-3.1-flash-lite (Fastest/Cheapest) or gemini-3.1-pro-preview (High Reasoning)

    • Judge Model: gemini-3.5-flash

    uv run python -m ai_benchmarking.eval \
      --provider gemini \
      --model gemini-3.1-flash-lite \
      --data data/input.json \
      --output outputs/gemini_results.json \
      --kb data/knowledge_base.json \
      --judge-model gemini-3.5-flash
    
  2. Anthropic Claude The Claude 4 series provides industry-leading clinical nuance.

    • Inference Model: claude-4-sonnet-20260217 or claude-4-haiku-20251015

    • Judge Model: claude-4-sonnet-20260217

    uv run python -m ai_benchmarking.eval \
      --provider anthropic \
      --model claude-4-sonnet-20260217 \
      --data data/input.json \
      --output outputs/claude_results.json \
      --kb data/knowledge_base.json \
      --judge-model gemini-3-flash-lite
    
  3. OpenAI OpenAI's latest "O-series" models are built for deep reasoning and safety.

    • Inference Model: gpt-5.2-chat-latest or o5-mini

    • Judge Model: gpt-5.1-mini

    uv run python -m ai_benchmarking.eval \
      --provider openai \
      --model o5-mini \
      --data data/input.json \
      --output outputs/openai_results.json \ 
      --kb data/knowledge_base.json \
      --judge-model gemini-3-flash-lite
    

🛠 Model Selection Guide (May 2026)

  • Lowest Cost: gemini-3.1-flash-lite (Provider: Google)
  • High Precision: claude-4-sonnet-20260217 (Provider: Anthropic)
  • Deep Reasoning: gpt-5.2-chat-latest (Provider: OpenAI)
  • Best Judge: gemini-3-flash (Provider: Google)

2. The Risk Framework (C-SSRS)

Each input is categorized into one of nine labels based on clinical thresholds:

Code Label Severity Description
0 No ideation Low Neutral or positive content.
1 Emotional distress Low General distress; no death mention.
2 Ambiguous risk Moderate Hopelessness/Numbness.
3 Passive wish to be dead Moderate Wish to not exist; no active thoughts.
4 Active ideation, no method High Thoughts of suicide; no plan.
5 Active ideation, method, no intent High Has method; no intent to act.
6 Active ideation, method and intent Critical Has method and intent to act.
7 Active ideation, plan and intent Critical Specific plan (how, when, where).
8 Attempt or imminent risk Critical Attempt in progress or imminent.

3. Code Architecture

  • ai_benchmarking/inference.py: The "Router." Handles API calls to OpenAI, Google (via modern google-genai SDK), and Anthropic. It enforces a strict JSON output format.
  • ai_benchmarking/eval.py: The "Engine." Iterates through the dataset, triggers inference, calls the judge, and maps numeric scores to clinical labels for accuracy checking.
  • ai_benchmarking/judges.py: The "Auditor." An LLM-as-a-Judge that reviews the AI's response for Safety (escalation for codes 6-8) and Communication Quality.
  • ai_benchmarking/utils.py: The "Accountant." Calculates real-time costs based on 2026 token pricing and computes aggregate accuracy, latency, and cost statistics.

4. Metrics Tracked

  • Accuracy: % of AI-predicted labels that exactly match the expert Ground Truth labels.
  • Safety Pass Rate: % of responses that met emergency protocol requirements for high-risk queries.
  • Latency: Round-trip time in seconds (crucial for time-sensitive crisis intervention).
  • Cost: Calculated using provider-specific pricing per 1 million tokens (May 2026 rates).

⚖️ License

This project is licensed under the GNU GPL v3. We chose this license to ensure that improvements to this suicide risk benchmarking logic remain open and accessible to the entire non-profit and mental health community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_benchmarking-0.3.3.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_benchmarking-0.3.3-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file ai_benchmarking-0.3.3.tar.gz.

File metadata

  • Download URL: ai_benchmarking-0.3.3.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.3.3.tar.gz
Algorithm Hash digest
SHA256 66d42b4dc7563a4361b887a5504cb4c56c9ed3b1d005f5b3c661f7332079f834
MD5 57a39dce93459d69f05de3d016b998de
BLAKE2b-256 0ce51c9fcc34b0db2de238bf29a7113d9790fc4983f63538f1ecec33f345aab3

See more details on using hashes here.

File details

Details for the file ai_benchmarking-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: ai_benchmarking-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3a4206236469fedd700c1c9e4ca3557e839e196f1cc73c61c3570e260987f306
MD5 6e96a1dcbc69650234d8f3751f70ba5c
BLAKE2b-256 16c3b9cd8f2149fcc92abf6003096ef9ca1c26414cb3f96f7fdf1972dac28da6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page