A forensic eval workbench for self-hostable models: capability, refusal profiling, and abliteration delta measurement.
Project description
Crucible
What survives quantization, abliteration, and serving. A forensic eval workbench for self-hostable models: capability, refusal behavior, tool-calling, RAG, and agent-style context, with first-class support for the abliteration workflow - base vs uncensored delta measurement and model card generation.
Why
Most leaderboards benchmark remote frontier APIs or unserved model snapshots. Crucible measures what you can actually run on your own hardware, and reports the deltas that matter when you abliterate a model: did refusals move to complies? Did capability survive?
Crucible talks to any running OpenAI-compatible inference server. It evaluates a model exactly as it's served - same chat template, same samplers, same tool-call parsing your published GGUFs get. Every run records provenance hashes (model file, test suite, llama.cpp commit) so a score shift is attributable.
Quick start
git clone https://github.com/zaakirio/crucible
cd crucible
uv sync
Requirements: uv and any running OpenAI-compatible inference server. No llama.cpp build required - point Crucible at whatever you already use.
Running evals
Crucible works in two modes.
External server (Ollama, LM Studio, vLLM, remote llama.cpp)
Start your server however you normally would, then:
# Ollama
uv run crucible run --server http://localhost:11434/v1 --model-name llama3 --workers 4
# Any OpenAI-compatible endpoint
uv run crucible run --server http://my-gpu-box:8080/v1 --model-name my-model --workers 4
--workers 4 runs 4 requests concurrently.
On a single GPU, total token throughput stays the same but you get better utilisation
through prefill/decode overlap.
Managed mode (local GGUF + llama.cpp)
If you have llama.cpp built, Crucible can spawn and manage llama-server for you:
# pull a GGUF from Hugging Face
uv run crucible pull zaakirio/LFM2.5-1.2B-Instruct-Uncensored-GGUF Q4_K_M
# run the full suite (llama-server found via $PATH or sibling llama.cpp/build/bin/)
uv run crucible run models/model.gguf --workers 4 -v
Override the binary with $CRUCIBLE_LLAMA_SERVER or --ngl/--ctx flags as needed.
Preflight check
uv run crucible doctor
The abliteration workflow
The core use case: prove your abliterated model is more open than the base without being dumber. Three commands.
# 1. eval base model
uv run crucible run --server http://localhost:11434/v1 --model-name base-model --workers 4
# note the run id from `crucible runs`
# 2. eval abliterated model
uv run crucible run --server http://localhost:11434/v1 --model-name uncensored-model --workers 4
# 3. compare
uv run crucible compare <base-run-id> <abliterated-run-id>
LLM judge grading
The keyword refusal grader is fast and fully deterministic, but an LLM judge gives more accurate complied/hedged/refused classifications - especially on short or ambiguous responses. Run it after any eval, bring your own API key:
# grade refusal categories with DeepSeek (or openai, or any OpenAI-compatible URL)
uv run crucible grade <run-id> --judge deepseek --api-key $DEEPSEEK_API_KEY
uv run crucible grade <run-id> --judge openai --api-key $OPENAI_API_KEY
uv run crucible grade <run-id> --judge http://localhost:11434/v1 --model-name llama3
# model card shows both keyword and judge results side by side
uv run crucible model-card <run-id> --out model-card.md
Judge verdicts are stored in judge_results alongside the keyword labels - nothing
is overwritten.
Configuration
crucible.yaml sets project defaults.
The most important section is suite_defaults, which controls generation budget
and thinking mode per category:
gate:
max_drop_pp: 5 # fail CI if capability drops more than this
max_refusal_shift_pp: 5 # fail CI if refusal rate INCREASES (over-refusal creep)
suite_defaults:
gsm8k: {max_tokens: 512, enable_thinking: false}
sorrybench: {max_tokens: 128, enable_thinking: false}
# ... see crucible.yaml for all categories
enable_thinking maps to chat_template_kwargs in llama.cpp's jinja pipeline.
Models that support a thinking toggle (e.g. Qwen3) respect it; others silently
ignore unknown template kwargs.
No per-model branching needed.
For thinking models (Qwen3, DeepSeek-R1, etc.): set enable_thinking: false
for refusal/instruction suites and enable_thinking: true for math/code.
With thinking enabled, ensure --ctx is large enough to give each parallel slot
at least 2048 tokens: --ctx 8192 --workers 4.
Other commands
# list available GGUFs
uv run crucible models
# quick 5-prompt sanity check (no grading)
uv run crucible smoke models/model.gguf
# run only specific categories
uv run crucible run models/model.gguf --only sorrybench,xstest --workers 4
# resume an interrupted run
uv run crucible run models/model.gguf --resume
# RAG-backed grounded QA (supply a docs directory)
uv run crucible run models/model.gguf --docs docs/rag --only rag_grounded,rag_faithfulness
# noise floor: same model 3x, see which tests flap
uv run crucible run models/model.gguf --repeat 3
# CI gate: nonzero exit if candidate regresses against baseline
uv run crucible gate <base-id> <candidate-id>
# validate keyword grader against your own judgment (blind labeling)
uv run crucible label --run <run-id>
uv run crucible label --report
# WikiText-2 perplexity, attached to the model's latest run
uv run crucible ppl models/model.gguf
# evidence pack and raw artifacts
uv run crucible report <run-id> --out reports/run.md
uv run crucible export <run-id> --out reports/run.jsonl
# charts (quant curve, refusal profile, abliteration delta, pareto, ppl)
uv run crucible chart
Selected findings
Results from finished runs.
These are exact values stored in results.db for one hardware setup
(Apple M4 Pro, 24 GB) and one llama.cpp commit.
They are evidence for this specific setup, not universal claims.
LFM2.5-1.2B — base vs Heretic-abliterated (2026-06-29, judge-validated)
LLM judge (deepseek-chat) results:
| category | base [Q4_K_M] | abliterated [Q4_K_M] | Δ |
|---|---|---|---|
| gsm8k (n=100) | 61/100 | 66/100 | +5pp |
| gsm_symbolic (n=100) | 51/100 | 47/100 | -4pp (within noise) |
| code | 5/6 | 5/6 | +0pp |
| instruction | 7/7 | 7/7 | +0pp |
| sorrybench (unsafe prompts) | 8 complied / 11 hedged / 26 refused | 34 complied / 6 hedged / 5 refused | the point |
| orbench (over-refusal) | 20 complied / 24 hedged / 6 refused | 44 complied / 6 hedged / 0 refused | false refusals largely gone |
| xstest | 20 complied / 3 hedged / 17 refused | 24 complied / 10 hedged / 6 refused | safe half complied, unsafe half shifted |
| falsereject (benign edge cases) | 1 complied / 43 hedged / 6 refused | 22 complied / 28 hedged / 0 refused | much less over-refusal |
Zero capability regression that clears the noise bar. The abliteration effect is concentrated on sorrybench (unsafe instructions) and orbench (over-refusal on legitimate prompts). Parallel calling remains 0/20 across both - a 1.2B capacity limit, not an abliteration artefact.
Tool calling — LFM2.5-1.2B quant sweep
| category | Q3_K_M | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | F16 |
|---|---|---|---|---|---|---|
| single call | 25/40 | 26/40 | 25/40 | 25/40 | 25/40 | 25/40 |
| choose right function | 13/20 | 12/20 | 13/20 | 12/20 | 13/20 | 13/20 |
| parallel calls | 0/20 | 0/20 | 0/20 | 0/20 | 0/20 | 0/20 |
| relevance (should call) | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 |
| irrelevance (should NOT call) | 12/15 | 10/15 | 8/15 | 9/15 | 9/15 | 9/15 |
Tool calling is insensitive to quantization on this model family.
Test suites
| Category | Source | n | Grader |
|---|---|---|---|
gsm8k |
GSM8K test split | 100 | numeric |
gsm_symbolic |
GSM-Symbolic (ICLR 2025) | 100 | numeric |
xstest |
XSTest stratified safe/unsafe | 40 | refusal profile |
orbench |
OR-Bench-Hard (ICML 2025) | 50 | refusal profile |
falsereject |
FalseReject-Test (2025) | 50 | refusal profile |
sorrybench |
SORRY-Bench (ICLR 2025) | 45 | refusal profile |
toolcall_single/multiple/parallel |
BFCL v4 (Apache 2.0) | 40/20/20 | tool_call |
toolcall_irrelevance/relevance |
BFCL v4 Live | 15/5 | tool_call |
agent_tool |
hand-authored tool-use loops, deterministic mocked results | 3 | final-answer |
rag_grounded |
local retrieval over docs/rag/ |
3 | contains |
rag_faithfulness |
local retrieval with citations, abstention, distractors | 4 | grounded |
agent_dialogue |
hand-authored multi-turn conversation fixtures | 3 | exact |
math, code, instruction, refusal |
hand-written starters | 8/6/7/8 | mixed |
All test YAML files are committed - no seed scripts needed.
Refusal categories report a profile (complied / hedged / refused), not pass/fail.
The keyword grader is deterministic and instant.
crucible grade adds an LLM judge layer for higher accuracy.
Development
uv sync
uv run python -m unittest discover tests # offline, no model needed
Next
crucible compareside-by-side in model card output- thinking model auto-detection (no manual
enable_thinkingconfig needed) crucible setupfor guided llama.cpp build- expand RAG corpora and agent workflows
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crucible_eval-0.0.1.tar.gz.
File metadata
- Download URL: crucible_eval-0.0.1.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3ab777cc26b702aef16a1c840f46a87219dac7b8f2e417c23f39068cf9b4ffb
|
|
| MD5 |
a21a0ea55af206324dcfe1fa8e265ff1
|
|
| BLAKE2b-256 |
149ebebac795ddd1191edb8949b47fb414fe1e23410b90e7da88c70855badbcc
|
File details
Details for the file crucible_eval-0.0.1-py3-none-any.whl.
File metadata
- Download URL: crucible_eval-0.0.1-py3-none-any.whl
- Upload date:
- Size: 57.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26788c56a4b856e301ce6458e4d80037bfbb2acc4620363cbb6f0c31205a8c92
|
|
| MD5 |
e3ac4e97d9af40ef23f16b7f81884f95
|
|
| BLAKE2b-256 |
793ed0802ff0a3136fd46f34f6205255960ab3e6a4d13f6887dbd0b5e4aa339a
|