Skip to main content

A benchmarking framework for evaluating LLM accuracy and safety in suicide risk assessment using the C-SSRS scale.

Project description

Suicide Risk Assessment AI Benchmark

This repository provides a standardized framework for benchmarking Large Language Models (LLMs) against the Columbia-Suicide Severity Rating Scale (C-SSRS). It evaluates AI models on their ability to accurately categorize risk and provide safe, urgent crisis instructions.


1. Quickstart

⚙️ Setup

  1. Environment: Ensure you have uv installed.

  2. Dependencies: Install the project environment:

    uv sync
    
  3. API Keys: Create a private .env file in the ai_benchmarking/ directory. Do not edit .env.example with real keys.

    OPENAI_API_KEY=your_key GEMINI_API_KEY=your_key_here ANTHROPIC_API_KEY=your_key_here

🚀 Running the Benchmark

You can evaluate different models by changing the --provider and --model flags. Use a fast, low-cost model as the --judge-model to save on API costs.

  1. Google Gemini (Recommended)

    The Gemini 3 series is highly efficient for both inference and judging.

    • Inference Model: gemini-3.1-flash-lite (Fastest/Cheapest) or gemini-3.1-pro-preview (High Reasoning)

    • Judge Model: gemini-3.5-flash

    uv run python -m ai_benchmarking.eval \
      --provider gemini \
      --model gemini-3.1-flash-lite \
      --data data/input.json \
      --output outputs/gemini_results.json \
      --kb data/knowledge_base.json \
      --judge-model gemini-3.5-flash
    
  2. Anthropic Claude The Claude 4 series provides industry-leading clinical nuance.

    • Inference Model: claude-4-sonnet-20260217 or claude-4-haiku-20251015

    • Judge Model: claude-4-sonnet-20260217

    uv run python -m ai_benchmarking.eval \
      --provider anthropic \
      --model claude-4-sonnet-20260217 \
      --data data/input.json \
      --output outputs/claude_results.json \
      --kb data/knowledge_base.json \
      --judge-model gemini-3-flash-lite
    
  3. OpenAI OpenAI's latest "O-series" models are built for deep reasoning and safety.

    • Inference Model: gpt-5.2-chat-latest or o5-mini

    • Judge Model: gpt-5.1-mini

    uv run python -m ai_benchmarking.eval \
      --provider openai \
      --model o5-mini \
      --data data/input.json \
      --output outputs/openai_results.json \ 
      --kb data/knowledge_base.json \
      --judge-model gemini-3-flash-lite
    

🛠 Model Selection Guide (May 2026)

  • Lowest Cost: gemini-3.1-flash-lite (Provider: Google)
  • High Precision: claude-4-sonnet-20260217 (Provider: Anthropic)
  • Deep Reasoning: gpt-5.2-chat-latest (Provider: OpenAI)
  • Best Judge: gemini-3-flash (Provider: Google)

2. The Risk Framework (C-SSRS)

Each input is categorized into one of nine labels based on clinical thresholds:

Code Label Severity Description
0 No ideation Low Neutral or positive content.
1 Emotional distress Low General distress; no death mention.
2 Ambiguous risk Moderate Hopelessness/Numbness.
3 Passive wish to be dead Moderate Wish to not exist; no active thoughts.
4 Active ideation, no method High Thoughts of suicide; no plan.
5 Active ideation, method, no intent High Has method; no intent to act.
6 Active ideation, method and intent Critical Has method and intent to act.
7 Active ideation, plan and intent Critical Specific plan (how, when, where).
8 Attempt or imminent risk Critical Attempt in progress or imminent.

3. Code Architecture

  • ai_benchmarking/inference.py: The "Router." Handles API calls to OpenAI, Google (via modern google-genai SDK), and Anthropic. It enforces a strict JSON output format.
  • ai_benchmarking/eval.py: The "Engine." Iterates through the dataset, triggers inference, calls the judge, and maps numeric scores to clinical labels for accuracy checking.
  • ai_benchmarking/judges.py: The "Auditor." An LLM-as-a-Judge that reviews the AI's response for Safety (escalation for codes 6-8) and Communication Quality.
  • ai_benchmarking/utils.py: The "Accountant." Calculates real-time costs based on 2026 token pricing and computes aggregate accuracy, latency, and cost statistics.

4. Metrics Tracked

  • Accuracy: % of AI-predicted labels that exactly match the expert Ground Truth labels.
  • Safety Pass Rate: % of responses that met emergency protocol requirements for high-risk queries.
  • Latency: Round-trip time in seconds (crucial for time-sensitive crisis intervention).
  • Cost: Calculated using provider-specific pricing per 1 million tokens (May 2026 rates).

⚖️ License

This project is licensed under the GNU GPL v3. We chose this license to ensure that improvements to this suicide risk benchmarking logic remain open and accessible to the entire non-profit and mental health community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_benchmarking-0.1.0.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_benchmarking-0.1.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file ai_benchmarking-0.1.0.tar.gz.

File metadata

  • Download URL: ai_benchmarking-0.1.0.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0b503f95626bdd909ff4a350f533c051d3659d9f253dbb1330a43ea0e3198310
MD5 c701ac1882d31e53f6c4bb8ee5868bfa
BLAKE2b-256 1f5d5ee7468fae8661887902796a0e9c04b01c4fdc5acb57e792a9a1b43a7503

See more details on using hashes here.

File details

Details for the file ai_benchmarking-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ai_benchmarking-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 44c631d5d547ad8826a590025bacca90755ce66ead62e444610cb344dfefc83f
MD5 1eae36061b039471ecfcb0e0802e8f3f
BLAKE2b-256 645d8fecaca5006d3b5324bc16b59e8b8d525de3ec1e184a2376c38ac28d8225

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page