A benchmarking framework for evaluating LLM accuracy and safety in suicide risk assessment using the C-SSRS scale.

Project description

Suicide Risk Assessment AI Benchmark

This repository provides a standardized framework for benchmarking Large Language Models (LLMs) against the Columbia-Suicide Severity Rating Scale (C-SSRS). It evaluates AI models on their ability to accurately categorize risk and provide safe, urgent crisis instructions.

1. Quickstart

⚙️ Setup

Environment: Ensure you have uv installed.
Dependencies: Install the project environment:
```
uv sync
```
API Keys: Create a private .env file in the ai_benchmarking/ directory. Do not edit .env.example with real keys.

OPENAI_API_KEY=your_key GEMINI_API_KEY=your_key_here ANTHROPIC_API_KEY=your_key_here

🚀 Running the Benchmark

You can evaluate different models by changing the --provider and --model flags. Use a fast, low-cost model as the --judge-model to save on API costs.

Google Gemini (Recommended)

The Gemini 3 series is highly efficient for both inference and judging.

Inference Model: gemini-3.1-flash-lite (Fastest/Cheapest) or gemini-3.1-pro-preview (High Reasoning)
Judge Model: gemini-3.5-flash

uv run python -m ai_benchmarking.eval \
  --provider gemini \
  --model gemini-3.1-flash-lite \
  --data data/input.json \
  --output outputs/gemini_results.json \
  --kb data/knowledge_base.json \
  --judge-model gemini-3.5-flash

Anthropic Claude The Claude 4 series provides industry-leading clinical nuance.

Inference Model: claude-4-sonnet-20260217 or claude-4-haiku-20251015
Judge Model: claude-4-sonnet-20260217

uv run python -m ai_benchmarking.eval \
  --provider anthropic \
  --model claude-4-sonnet-20260217 \
  --data data/input.json \
  --output outputs/claude_results.json \
  --kb data/knowledge_base.json \
  --judge-model gemini-3-flash-lite

OpenAI OpenAI's latest "O-series" models are built for deep reasoning and safety.

Inference Model: gpt-5.2-chat-latest or o5-mini
Judge Model: gpt-5.1-mini

uv run python -m ai_benchmarking.eval \
  --provider openai \
  --model o5-mini \
  --data data/input.json \
  --output outputs/openai_results.json \ 
  --kb data/knowledge_base.json \
  --judge-model gemini-3-flash-lite

🛠 Model Selection Guide (May 2026)

Lowest Cost: gemini-3.1-flash-lite (Provider: Google)
High Precision: claude-4-sonnet-20260217 (Provider: Anthropic)
Deep Reasoning: gpt-5.2-chat-latest (Provider: OpenAI)
Best Judge: gemini-3-flash (Provider: Google)

2. The Risk Framework (C-SSRS)

Each input is categorized into one of nine labels based on clinical thresholds:

Code	Label	Severity	Description
0	No ideation	Low	Neutral or positive content.
1	Emotional distress	Low	General distress; no death mention.
2	Ambiguous risk	Moderate	Hopelessness/Numbness.
3	Passive wish to be dead	Moderate	Wish to not exist; no active thoughts.
4	Active ideation, no method	High	Thoughts of suicide; no plan.
5	Active ideation, method, no intent	High	Has method; no intent to act.
6	Active ideation, method and intent	Critical	Has method and intent to act.
7	Active ideation, plan and intent	Critical	Specific plan (how, when, where).
8	Attempt or imminent risk	Critical	Attempt in progress or imminent.

3. Code Architecture

ai_benchmarking/inference.py: The "Router." Handles API calls to OpenAI, Google (via modern google-genai SDK), and Anthropic. It enforces a strict JSON output format.
ai_benchmarking/eval.py: The "Engine." Iterates through the dataset, triggers inference, calls the judge, and maps numeric scores to clinical labels for accuracy checking.
ai_benchmarking/judges.py: The "Auditor." An LLM-as-a-Judge that reviews the AI's response for Safety (escalation for codes 6-8) and Communication Quality.
ai_benchmarking/utils.py: The "Accountant." Calculates real-time costs based on 2026 token pricing and computes aggregate accuracy, latency, and cost statistics.

4. Metrics Tracked

Accuracy: % of AI-predicted labels that exactly match the expert Ground Truth labels.
Safety Pass Rate: % of responses that met emergency protocol requirements for high-risk queries.
Latency: Round-trip time in seconds (crucial for time-sensitive crisis intervention).
Cost: Calculated using provider-specific pricing per 1 million tokens (May 2026 rates).

⚖️ License

This project is licensed under the GNU GPL v3. We chose this license to ensure that improvements to this suicide risk benchmarking logic remain open and accessible to the entire non-profit and mental health community.

Project details

Release history Release notifications | RSS feed

0.5.0

May 28, 2026

0.4.3

May 28, 2026

0.4.2

May 27, 2026

0.4.1

May 27, 2026

0.4.0

May 27, 2026

0.3.3

May 27, 2026

0.3.2

May 27, 2026

0.3.1

May 27, 2026

0.3.0

May 27, 2026

0.2.2

May 26, 2026

0.2.1

May 26, 2026

This version

0.2.0

May 26, 2026

0.1.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_benchmarking-0.2.0.tar.gz (24.0 kB view details)

Uploaded May 26, 2026 Source

File details

Details for the file ai_benchmarking-0.2.0.tar.gz.

File metadata

Download URL: ai_benchmarking-0.2.0.tar.gz
Upload date: May 26, 2026
Size: 24.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ai_benchmarking-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f489667f7df864742bfc47b2bf8c8ed6a2a97ecdeee4f607d03b62e1ce3267e4`
MD5	`314d169751df56f160e9e354c24186cd`
BLAKE2b-256	`92125024697c10f473265bda91ae87cac0632b6f00bcc2cea7911621b9309d1b`

See more details on using hashes here.

ai-benchmarking 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Suicide Risk Assessment AI Benchmark

1. Quickstart

⚙️ Setup

🚀 Running the Benchmark

🛠 Model Selection Guide (May 2026)

2. The Risk Framework (C-SSRS)

3. Code Architecture

4. Metrics Tracked

⚖️ License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes