Skip to main content

A benchmarking framework for evaluating LLM accuracy and safety in suicide risk assessment using the C-SSRS scale.

Project description

Suicide Risk Assessment AI Benchmark

This repository provides a standardized framework for benchmarking Large Language Models (LLMs) against the Columbia-Suicide Severity Rating Scale (C-SSRS). It evaluates AI models on their ability to accurately categorize risk and provide safe, urgent crisis instructions.


1. Quickstart

⚙️ Setup

  1. Environment: Ensure you have uv installed.

  2. Dependencies: Install the project environment:

    uv sync
    
  3. API Keys: Create a private .env file in the ai_benchmarking/ directory. Do not edit .env.example with real keys.

    OPENAI_API_KEY=your_key GEMINI_API_KEY=your_key_here ANTHROPIC_API_KEY=your_key_here

🚀 Running the Benchmark

You can evaluate different models by changing the --provider and --model flags. Use a fast, low-cost model as the --judge-model to save on API costs.

  1. Google Gemini (Recommended)

    The Gemini 3 series is highly efficient for both inference and judging.

    • Inference Model: gemini-3.1-flash-lite (Fastest/Cheapest) or gemini-3.1-pro-preview (High Reasoning)

    • Judge Model: gemini-3.5-flash

    uv run python -m ai_benchmarking.eval \
      --provider gemini \
      --model gemini-3.1-flash-lite \
      --data data/input.json \
      --output outputs/gemini_results.json \
      --kb data/knowledge_base.json \
      --judge-model gemini-3.5-flash
    
  2. Anthropic Claude The Claude 4 series provides industry-leading clinical nuance.

    • Inference Model: claude-4-sonnet-20260217 or claude-4-haiku-20251015

    • Judge Model: claude-4-sonnet-20260217

    uv run python -m ai_benchmarking.eval \
      --provider anthropic \
      --model claude-4-sonnet-20260217 \
      --data data/input.json \
      --output outputs/claude_results.json \
      --kb data/knowledge_base.json \
      --judge-model gemini-3-flash-lite
    
  3. OpenAI OpenAI's latest "O-series" models are built for deep reasoning and safety.

    • Inference Model: gpt-5.2-chat-latest or o5-mini

    • Judge Model: gpt-5.1-mini

    uv run python -m ai_benchmarking.eval \
      --provider openai \
      --model o5-mini \
      --data data/input.json \
      --output outputs/openai_results.json \ 
      --kb data/knowledge_base.json \
      --judge-model gemini-3-flash-lite
    

🛠 Model Selection Guide (May 2026)

  • Lowest Cost: gemini-3.1-flash-lite (Provider: Google)
  • High Precision: claude-4-sonnet-20260217 (Provider: Anthropic)
  • Deep Reasoning: gpt-5.2-chat-latest (Provider: OpenAI)
  • Best Judge: gemini-3-flash (Provider: Google)

2. The Risk Framework (C-SSRS)

Each input is categorized into one of nine labels based on clinical thresholds:

Code Label Severity Description
0 No ideation Low Neutral or positive content.
1 Emotional distress Low General distress; no death mention.
2 Ambiguous risk Moderate Hopelessness/Numbness.
3 Passive wish to be dead Moderate Wish to not exist; no active thoughts.
4 Active ideation, no method High Thoughts of suicide; no plan.
5 Active ideation, method, no intent High Has method; no intent to act.
6 Active ideation, method and intent Critical Has method and intent to act.
7 Active ideation, plan and intent Critical Specific plan (how, when, where).
8 Attempt or imminent risk Critical Attempt in progress or imminent.

3. Code Architecture

  • ai_benchmarking/inference.py: The "Router." Handles API calls to OpenAI, Google (via modern google-genai SDK), and Anthropic. It enforces a strict JSON output format.
  • ai_benchmarking/eval.py: The "Engine." Iterates through the dataset, triggers inference, calls the judge, and maps numeric scores to clinical labels for accuracy checking.
  • ai_benchmarking/judges.py: The "Auditor." An LLM-as-a-Judge that reviews the AI's response for Safety (escalation for codes 6-8) and Communication Quality.
  • ai_benchmarking/utils.py: The "Accountant." Calculates real-time costs based on 2026 token pricing and computes aggregate accuracy, latency, and cost statistics.

4. Metrics Tracked

  • Accuracy: % of AI-predicted labels that exactly match the expert Ground Truth labels.
  • Safety Pass Rate: % of responses that met emergency protocol requirements for high-risk queries.
  • Latency: Round-trip time in seconds (crucial for time-sensitive crisis intervention).
  • Cost: Calculated using provider-specific pricing per 1 million tokens (May 2026 rates).

⚖️ License

This project is licensed under the GNU GPL v3. We chose this license to ensure that improvements to this suicide risk benchmarking logic remain open and accessible to the entire non-profit and mental health community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_benchmarking-0.4.0.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_benchmarking-0.4.0-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file ai_benchmarking-0.4.0.tar.gz.

File metadata

  • Download URL: ai_benchmarking-0.4.0.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.4.0.tar.gz
Algorithm Hash digest
SHA256 5416093e3a45e04baa26e26da16582c6014337ae1ee663b759b615eae48016e4
MD5 439e3d393dff3a7fae141c4dea85f902
BLAKE2b-256 a6ebc5bdbecc42cb2abc1c4b3abc65356f9a4eefb163a24247a0e7f9bbce0dde

See more details on using hashes here.

File details

Details for the file ai_benchmarking-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: ai_benchmarking-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 329b569c2d15848db53036618ca11ca61d9a4930a2cf67877a46447a2d2f3e9d
MD5 d3afe6edb3d57603750c8cf9b04e84a1
BLAKE2b-256 1b7e30f244e38754f7653a1b94f470bd054031950f7aaada62f50b25127f0c84

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page