Skip to main content

Behavioral profiling benchmark for LLMs. Profile any model's personality, extract steering vectors, generate DPO training pairs.

Project description

cane-personality

Behavioral profiling benchmark for LLMs. Profile any model's personality, extract steering vectors, generate DPO training pairs to fix what's broken.

PyPI

                   Qwen-2.5-72B  OLMo-2-32B  DeepSeek-V3  INTELLECT-3  Qwen-2.5-7B
Overall Score           90.7        90.5        90.0         88.2         87.5
Overconfidence           3.8         3.3         6.2          6.8          6.0
Calibration             92.8        92.4        90.9         89.3         89.3
Verbosity               93.9        95.9        99.0         97.3         94.8
Hedging                  9.4         8.5         7.4          7.5         11.1
Groundedness            90.8        90.6        90.2         88.4         87.5
Completeness            86.9        86.9        88.9         86.8         83.9

Fails (out of 300)        10          11          19           22           16
DPO Pairs Generated       11          11          17           21           22

DPO Training Results (Qwen-2.5-7B): 22 auto-generated DPO pairs, one round of QLoRA training on an RTX 4070 laptop GPU (2h 11m):

  • Fabrication fails: 16 to 7 (down 56%)
  • 9 out of 16 groundedness failures fixed on unseen questions
  • Model learned epistemic humility from 22 examples

New in v0.2.0

  • Checkpoint/resume: interrupted runs pick up where they left off
  • Embedding cache: repeated runs skip re-embedding
  • Better hedging detection: word boundary matching replaces substring matching
  • Custom judge prompts: bring your own scoring rubric
  • Score validation: judge scores clamped to 0-100
  • Progress bars: optional tqdm support (pip install cane-personality[progress])

What it does

300-question behavioral probe suite across 6 personality traits, 3 difficulty tiers. Run it against any model, get three outputs:

  1. Behavioral profile with trait scores, embedding space visualization, and cluster analysis
  2. Steering vectors pointing from overconfident to calibrated in embedding space
  3. DPO training pairs (chosen/rejected) ready for TRL, OpenRLHF, or PRIME-RL

Quick start

pip install cane-personality[all]
export ANTHROPIC_API_KEY=sk-ant-...

# Profile a model
cane-personality run --model claude-sonnet-4-5-20250929 --html report.html

# Profile with OpenAI
cane-personality run --provider openai --model gpt-4o

# Profile local model via Ollama
cane-personality run --provider ollama --model llama3 --base-url http://localhost:11434/v1

Three outputs

1. Behavioral profile

cane-personality run --model claude-sonnet-4-5-20250929 --html report.html

Interactive HTML report with:

  • Trait scores across 6 dimensions (radar chart)
  • Embedding space scatter plot (pass/warn/fail clusters)
  • Cluster analysis with semantic labels

2. Steering vectors

cane-personality run --model my-model --export-vectors vectors.json

Directions in embedding space between behavioral poles:

  • Overconfidence vector: calibrated confidence -> overconfidence
  • Quality vector: high-quality -> low-quality responses

Export as JSON for representation engineering or inference-time intervention.

3. DPO training pairs

cane-personality run --model my-model --export-dpo pairs.jsonl

Every contrastive pair (confidently right vs. confidently wrong) exported as:

{"prompt": "...", "chosen": "...", "rejected": "...", "trait": "overconfidence"}

Ready for TRL, OpenRLHF, or PRIME-RL. Tagged by trait so you can target specific behavioral fixes.

Personality traits

Trait What it measures Low score High score
Overconfidence Confidently wrong Well-calibrated Confidently hallucinating
Calibration Certainty matches correctness Poorly calibrated Well-calibrated
Verbosity Response length vs expected Terse Rambling
Hedging Unnecessary qualification Direct and clear Over-qualified
Groundedness Answers grounded in facts Fabricating Fact-based
Completeness Covers all key points Missing parts Thorough

Probe suite

300 questions across 6 traits and 3 difficulty tiers:

Trait Easy (15) Medium (20) Hard (15) Total
Overconfidence Common facts Misconceptions Obscure topics 50
Calibration Unknowable questions Debatable topics Uncertain science 50
Hedging Basic math Established facts Definitive technical 50
Verbosity Yes/no questions One-sentence answers Precise definitions 50
Groundedness Fake citations Obscure facts Plausible fakes 50
Completeness Two-part questions Three-part comparisons Multi-dimensional 50

Compare models

# Compare against shipped baselines
cane-personality compare --baselines intellect3,olmo2,qwen25 --html comparison.html

# Compare your profiles
cane-personality compare --profiles model_a.json,model_b.json --html comparison.html

Generates side-by-side comparison with trait table, overlaid radar charts, and per-trait rankings.

Python API

from cane_personality import Profiler, Judge, export_dpo_pairs

# Score responses with built-in judge
judge = Judge(provider="anthropic", model="claude-haiku-4-5-20241022")
score = judge.score(question, expected_answer, agent_answer)

# Profile from results
profiler = Profiler(embedding_model="all-MiniLM-L6-v2")
profile = profiler.profile(results, model_name="my-model")

# Access traits
print(profile.personality.trait_scores)

# Export steering vectors
for sv in profile.steering_vectors:
    print(f"{sv.name}: magnitude {sv.magnitude:.3f}")

# Generate reports
profile.to_html("report.html")

# Export DPO pairs
export_dpo_pairs(profile, "pairs.jsonl")

Known Limitations

  • Judge quality depends on the scoring model. Haiku is fast and cheap but may miss nuance. Sonnet or GPT-4o produce more accurate scores.
  • The 300-question suite is a first release. Some questions may be too easy for frontier models. Harder adversarial probes are planned for v0.3.
  • DPO pairs are generated from a single run. Multiple runs would improve statistical reliability.
  • Hedging detection uses regex word boundary matching, which may still have edge cases.

Install

pip install cane-personality                   # core (numpy, pyyaml)
pip install cane-personality[anthropic]        # + Anthropic provider
pip install cane-personality[openai]           # + OpenAI/Ollama provider
pip install cane-personality[embeddings]       # + sentence-transformers
pip install cane-personality[all]              # everything

How it works

Probe Suite (300 Q) --> Target Model --> LLM Judge --> Trait Scoring
                                                          |
                                    +---------+-----------+---------+
                                    |         |                     |
                              Embed (MiniLM)  |              DPO Pairs
                                    |         |             (chosen/rejected)
                              PCA / UMAP      |
                                    |         v
                              K-means    Steering Vectors
                              Clusters   (overconfidence,
                                          quality)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cane_personality-0.2.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cane_personality-0.2.0-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file cane_personality-0.2.0.tar.gz.

File metadata

  • Download URL: cane_personality-0.2.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cane_personality-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4b7fc468fdc0c4201ea22913bc4434d14dfd5bb41b3f22e1ae628f325be3ca64
MD5 7b213e6f3d1b6a3bda90409c8cda0daa
BLAKE2b-256 767936511d71c40644baab095547fc651d328051a21f535c98a0b55460026721

See more details on using hashes here.

File details

Details for the file cane_personality-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cane_personality-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fc604cefa0bf2a8f618ec7c29128a440be86ccdd42e4c6a8e9a5c6c184829c26
MD5 126952eaece07465f1f57188e45b7aba
BLAKE2b-256 ada2f661c98b79efc7993b2b04df4e54adbfdced7fce5213ad08c31d96d57d32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page