Dual-layer scoring framework combining automated checks (left brain) with manual qualitative grading (right brain) and reconciliation.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Swiftwing21

These details have not been verified by PyPI

Project description

ScoreRift

The product isn't the scoring. It's watching the gap.

Automated checks exist everywhere. Manual reviews exist everywhere. What doesn't exist is a system that runs both continuously, compares them over time, and alerts you the moment they disagree.

That disagreement — the divergence — is where the real information lives.

  LEFT BRAIN (Auto)              RIGHT BRAIN (Manual)
  ─────────────────              ────────────────────
  pytest pass rate    ──┐    ┌── Human grade (A)
  ruff lint score     ──┤    ├── LLM review findings
  semgrep scan        ──┤    ├── User feedback (4.2/5)
  endpoint health     ──┘    └── Team notes
                        │    │
                        ▼    ▼
                   ┌───────────┐
                   │ THE GAP   │  <── this is the product
                   └─────┬─────┘
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
           Aligned    DIVERGED    Failing
           (quiet)    (signal)     (alarm)

When both brains agree, everything's fine — move on. When they disagree, something interesting just happened: either reality changed and the reviewer hasn't caught up, or the reviewer sees something the automation can't. Either way, that gap is worth investigating.

What Divergence Actually Catches

What happened	Auto says	Manual says	Gap means
Tests pass but codebase rotted	A	C+ (stale)	Auto is right. Manual grade expired — reviewer hasn't looked since the refactor.
Reviewer bumped grade after "looks good" review	B+	A	Manual is optimistic. Auto sees real issues the reviewer glossed over.
Security vuln in dependency	D	A	Auto caught it. Manual grade was set before the CVE dropped.
"Feels slow" but metrics are fine	A	B-	Manual is right. Users feel something automation can't measure.
Big refactor, nothing broke	A	B (cautious)	Auto is right. Reviewer is still nervous from the refactor but tests confirm it's solid.
Compliance review happened	B+	A (with notes)	Manual is right. External auditor validated things auto can't check.

None of these are caught by running either brain alone. The signal is in the disagreement.

Quick Start

pip install scorerift

scorerift init                      # create DB + baseline sidecar
scorerift register --preset python  # 8 dimensions for Python projects
scorerift run light                 # first scan (~2s)
scorerift status                    # view scores + divergences

Dimension                  Auto   Grade  Manual  Status
-----------------------------------------------------------------
  test_coverage            0.930      A      —   ok
  lint_score               1.000      S      —   ok
  security                 0.850     A-     A+   DIVERGED
  type_coverage            0.720     B-      —   ok

Overall: A- (0.876)    Divergences: 1

That DIVERGED on security is the system working. Auto scored 0.85 (A-), but someone manually graded it A+. Who's wrong? That's the conversation worth having.

How It Works

Divergence detection fires when |auto_score - manual_score| > 0.15 AND auto-confidence is above 50%. Low-confidence dimensions (like UX at 30%) can't trigger divergence — the system doesn't argue with humans until it has enough data to form a real opinion.

Three resolution paths:

Update manual grade — you agree with auto, fix the sidecar
Acknowledge — you disagree with auto, dismiss the alert (visible but dimmed)
Re-audit — run a deeper tier or request an LLM review for a second opinion

Ratchet rules prevent silent regression: set a floor grade, and if the score drops below it, the system flags it. Ratchets are advisory — they warn, not block.

Six defense layers prevent the system from lying to you:

Layer	What it catches
Functional test scoring	Grade inflation (scores from tests, not file existence)
Grade expiry	Stale optimism (old manual grades display as expired)
Cross-validation	Optimistic reviewers (divergence when manual >> auto)
Git diff detection	Silent drift (code changed since last manual review)
External scanners	Blind spots (semgrep, PyPI drift — independent signals)
Ratchet rules	Backsliding (score can't drop below target without explicit edit)

The Python Preset (8 real checks, 0 stubs)

Dimension	What it runs	Confidence
test_coverage	`pytest` pass rate	95%
lint_score	`ruff check` error count	90%
type_coverage	`mypy` error count	85%
dep_freshness	`pip list --outdated`	85%
doc_coverage	AST docstring ratio + README existence	80%
security	semgrep SAST or ruff S rules (fallback)	60%
complexity	radon or AST McCabe analysis (fallback)	80%
import_hygiene	`ruff check --select I,F401`	85%

Every check handles missing tools gracefully (returns 0.5 with a note, not a crash). Confidence determines how much weight divergence detection gives each dimension.

Wrap Your Existing Tools

ScoreRift doesn't replace your tooling. It sits on top of it.

A dimension's check is just Callable[[], tuple[float, dict]]. That callable can hit any API, parse any CLI output, or query any system. The framework doesn't care where the score comes from — it just needs a number between 0 and 1, and a detail dict.

# Wrap SonarQube's quality gate
def sonarqube_gate():
    resp = requests.get(f"{SONAR_URL}/api/qualitygates/project_status",
                        params={"projectKey": PROJECT}, timeout=10)
    data = resp.json()["projectStatus"]
    return (1.0 if data["status"] == "OK" else 0.4, data)

# Wrap Datadog SLO
def datadog_slo():
    resp = requests.get(f"{DD_URL}/api/v1/slo/{SLO_ID}",
                        headers={"DD-API-KEY": KEY}, timeout=10)
    sli = resp.json()["data"]["overall_status"][0]["sli_value"]
    return (sli / 100.0, {"sli": sli})

# Wrap any CLI tool
def pylint_score():
    result = subprocess.run(["pylint", "src/", "--output-format=json"],
                            capture_output=True, text=True, timeout=120)
    data = json.loads(result.stdout)
    score = max(0.0, (10 - len(data)) / 10)  # normalize to 0-1
    return (score, {"issues": len(data)})

engine.register(Dimension(name="sonarqube", check=sonarqube_gate, confidence=0.9, tier=Tier.DAILY))
engine.register(Dimension(name="slo_compliance", check=datadog_slo, confidence=0.95, tier=Tier.LIGHT))
engine.register(Dimension(name="pylint", check=pylint_score, confidence=0.85, tier=Tier.MEDIUM))

This reframes the entire project: not "alternative to SonarQube" but "the layer that watches whether SonarQube and your team's manual assessment still agree." Use the presets to get started, then wire in whatever tools your team already runs.

LLM-Powered Reviews (The Right Brain On Demand)

The manual grade doesn't have to come from you. Point an LLM at any dimension and get a structured review with cross-validated findings.

Three review modes:

Mode	What it does	API calls	Best for
Single	One provider, one pass	1	Quick sanity check
Swarm	One provider, 4 specialized lenses (security auditor, performance engineer, architect, compliance auditor) — findings cross-validated	4	Deep single-dimension review
Consensus	Same prompt to Claude + Gemini + OpenAI, compare scores	1 per provider	When you want multiple opinions

Cost scales with context size. The system prompt is ~200 tokens. Your cost is driven by how much code/context you pass in:

Context size	Sonnet (single)	Sonnet (swarm, 4 lenses)	Flash + 4o-mini (consensus)
~1K tokens (one file)	~$0.005	~$0.02	~$0.002
~10K tokens (module)	~$0.05	~$0.20	~$0.01
~50K tokens (small repo)	~$0.20	~$0.80	~$0.05

Every review logs exact input/output tokens and cost in the DB. Use cache.stats() to see cumulative spend. Cached reviews (7-day TTL) cost nothing on repeat runs.

# Trigger a swarm review — 4 lenses review independently, then cross-validate
result = engine.review_dimension("security", context=source_code, mode="swarm")
# result["cross_validated"] = findings 2+ lenses agree on (high confidence)
# result["single_source"] = findings from only 1 lens (investigate)

# Multi-provider consensus — do Claude and Gemini agree?
result = engine.review_dimension("architecture", context=source_code, mode="consensus")
# result["agreement"] = 0.92 (they mostly agree)
# result["provider_results"]["claude"]["grade"] = "A"
# result["provider_results"]["gemini"]["grade"] = "A-"

The review result automatically updates the sidecar with source: "llm_review". Now you have three layers of gap detection:

Auto score vs manual grade (original divergence)
LLM review vs auto score (does the model see something automation missed?)
LLM review vs LLM review (do Claude and Gemini disagree? That's a signal too)

Swarm lenses are where the magic happens. Four specialized reviewers look at the same code from different angles — a security auditor catches different things than a performance engineer. Findings that appear in 2+ lenses are cross-validated (high confidence). Findings from only one lens are flagged as single-source (investigate, but lower confidence).

Built-in cost safety:

Local result cache (7-day TTL) — same context + same provider = cached, no API call
All costs tracked per review in the DB
Providers that aren't configured (no API key) are silently skipped

# Configure providers via environment variables
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AI...
export OPENAI_API_KEY=sk-...
# Any combination works — uses whichever keys are available

Web Dashboard

pip install scorerift[dashboard]
scorerift dashboard                 # http://localhost:8484/audit/
scorerift dashboard --native        # PyWebView native window

Grade ring, score bars, divergence alerts, tier triggers, feedback widget, review tracking. Self-contained HTML, zero CDN dependencies.

Full walkthrough with examples → docs/QUICKSTART.md

Python API

from scorerift import AuditEngine, Dimension, Tier

engine = AuditEngine(db_path="audit.db", baseline_path="audit_baseline.json")

engine.register(Dimension(
    name="test_coverage",
    check=lambda: (passed / total, {"passed": passed, "total": total}),
    confidence=0.95,
    tier=Tier.LIGHT,
))

results = engine.run_tier("daily")
health = engine.health_check()  # {"ok": True, "grade": "A", "divergences": 0, ...}

# The interesting part: what do the brains disagree on?
for d in engine.get_divergences():
    print(f"{d.name}: auto={d.auto_score:.2f} vs manual={d.manual_grade}")

CI Integration

- name: Audit health check
  run: |
    pip install scorerift
    scorerift health

Exit code 0 = aligned. Exit code 1 = divergences or failing dimensions. The JSON output tells you exactly what disagrees and by how much.

More Presets

Preset	Dimensions	Best for
`python`	8 real checks	Python repos
`api`	8 dimensions	REST APIs
`database`	7 dimensions	Databases
`infrastructure`	8 dimensions	DevOps
`ml_pipeline`	7 dimensions	ML workflows

Dogfooding: TBR Auditing Itself

We ran scorerift on its own codebase. Here's what happened.

Step 1: Auto-scorer says everything is perfect.

Dimension              Auto  Grade   Status
  testing             1.000      S   ok
  lint                1.000      S   ok
  security            1.000      S   ok
  test_depth          1.000      S   ok
  packaging           1.000      S   ok
  ci                  1.000      S   ok
  documentation       0.920     A+   ok
  type_coverage       0.748      B   ok

Overall: A+ (0.959)

Seven S-tier scores. Auto says ship it.

Step 2: Human reviewer finds real issues.

A deep code review graded the same dimensions differently:

Dimension	Auto	Human	Gap
testing	S (1.0)	A-	Auto counts pass rate. Human found missing DB roundtrip tests, untested reviewer modules.
security	S (1.0)	B+	Auto's ruff S rules found 0 issues. Human found os.chdir thread safety hazard, sidecar path traversal weakness.
test_depth	S (1.0)	A-	Auto counts test files per module (8/8). Human noted reviewer modules have zero coverage.
lint	S (1.0)	A-	Auto: ruff clean. Human found orphaned functions, inconsistent fallback behavior.

Step 3: Divergence detector fires.

Divergences: 4
  testing:    auto=1.000 vs manual=A- (gap=0.150)
  lint:       auto=1.000 vs manual=A- (gap=0.150)
  security:   auto=1.000 vs manual=B+ (gap=0.200)
  test_depth: auto=1.000 vs manual=A- (gap=0.150)

Step 4: Fix the real issues.

The divergences pointed to 10 concrete fixes: thread safety lock on CWD changes, atomic sidecar writes, consistent error fallbacks, missing test coverage, orphaned reconciler functions. All fixed in v0.4.0, tests went from 99 to 117.

The auto-scorer was wrong. Not because it's broken — it correctly measured what it measures (pass rate, lint errors, file counts). But those measurements missed real issues that only a reviewer could see. Without the divergence detection, we would have shipped with a thread-safety bug.

This is the entire product thesis in one example: neither brain alone is sufficient.

Docs

Quickstart Guide — step-by-step with examples
Standards Reference — what we measure and why (OWASP, SRE, DORA, Clean Code, etc.)
Architecture — design decisions and data flow
examples/biged/ — 12-dimension reference implementation

Desktop GUI

ScoreRift Studio — native desktop app for configuring, running, and reviewing audits without the CLI. Open any folder, pick a preset, run audits, edit manual grades, export reports.

Origin

Extracted from BigEd CC after production use on a 125-skill AI fleet with 12 audit dimensions, 4 tiers, and automated daily/weekly scheduling. The divergence detection pattern caught real issues that neither automated tests nor manual reviews caught alone.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Swiftwing21

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scorerift-2.0.0.tar.gz (81.8 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scorerift-2.0.0-py3-none-any.whl (69.2 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file scorerift-2.0.0.tar.gz.

File metadata

Download URL: scorerift-2.0.0.tar.gz
Upload date: Apr 3, 2026
Size: 81.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scorerift-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b31425619f6d64df20f33fe08900b833a623507cf7adbf1be2c18ef19f8d3858`
MD5	`8233ce11506ad241d8a051861e1bfb87`
BLAKE2b-256	`961cafaa9047f0e6b6bbf4554e2078a4254d4137c27c063d66c1d6c76fa829c5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scorerift-2.0.0.tar.gz:

Publisher: publish.yml on SwiftWing21/scorerift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scorerift-2.0.0.tar.gz
- Subject digest: b31425619f6d64df20f33fe08900b833a623507cf7adbf1be2c18ef19f8d3858
- Sigstore transparency entry: 1228223787
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: SwiftWing21/scorerift@dc100857de5c9f3b1a4c204738fe01b9b9b729fd
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/SwiftWing21
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@dc100857de5c9f3b1a4c204738fe01b9b9b729fd
- Trigger Event: push

File details

Details for the file scorerift-2.0.0-py3-none-any.whl.

File metadata

Download URL: scorerift-2.0.0-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 69.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scorerift-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd5c8f763d39a967cf27336d2c14238ec8ded19ee363049b810b34e728cf294d`
MD5	`ff694274ea70a84cf8616067fd12fa90`
BLAKE2b-256	`63d55fcfe0f3cddd086d035efdc431b61b7bb68db811a3fc4084ff39ea60fb02`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scorerift-2.0.0-py3-none-any.whl:

Publisher: publish.yml on SwiftWing21/scorerift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scorerift-2.0.0-py3-none-any.whl
- Subject digest: cd5c8f763d39a967cf27336d2c14238ec8ded19ee363049b810b34e728cf294d
- Sigstore transparency entry: 1228223851
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: SwiftWing21/scorerift@dc100857de5c9f3b1a4c204738fe01b9b9b729fd
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/SwiftWing21
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@dc100857de5c9f3b1a4c204738fe01b9b9b729fd
- Trigger Event: push

scorerift 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ScoreRift

What Divergence Actually Catches

Quick Start

How It Works

The Python Preset (8 real checks, 0 stubs)

Wrap Your Existing Tools

LLM-Powered Reviews (The Right Brain On Demand)

Web Dashboard

Python API

CI Integration

More Presets

Dogfooding: TBR Auditing Itself

Docs

Desktop GUI

Origin

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance