Skip to main content

Agent-aware code quality system for multi-agent codebases

Project description

Arbiter

Agent-aware code quality system for multi-agent codebases.

In 2026, code is written by fleets of AI agents. Arbiter knows who wrote each line — human or AI — and scores quality accordingly.

What Makes Arbiter Different

Feature Traditional Tools Arbiter
Agent attribution None First-class: tracks Claude, Codex, Gemini, Copilot, humans
Per-commit scoring Repo-wide only Scores each commit's changed files individually
Diff analysis N/A Score only what changed in a PR/branch
Transparency Opaque score Every score decomposes into lint + security + complexity
Agent-specific gates N/A Different quality thresholds per agent trust tier
Tool integration Proprietary Wraps tools you already trust: ruff, Bandit, radon, vulture
Dashboard SaaS login Single HTML file with per-agent timelines, commit feed, fleet view
Dependencies Heavy Analysis tools only; core is stdlib Python

Quick Start

git clone https://github.com/hummbl-dev/arbiter.git
cd arbiter

# Install (makes `arbiter` command available)
pip install ".[analyzers]"

# Quick score (no persistence)
arbiter score /path/to/your/repo

# Full analysis with per-commit agent attribution
arbiter analyze /path/to/your/repo

# Score only files changed since main
arbiter diff /path/to/your/repo --base main

# Agent leaderboard
arbiter agents

# Start dashboard
arbiter serve --port 8080
# Open http://localhost:8080

Without install (PYTHONPATH)

PYTHONPATH=src python -m arbiter score /path/to/your/repo

With Docker

docker build -t arbiter .
docker run -p 8080:8080 -v /path/to/repo:/repo:ro arbiter

Architecture

Git Repo ──→ [Git Historian] ──→ [Analyzer Runner] ──→ [Scoring Engine] ──→ [SQLite Store]
                  │                      │                     │                    │
           agent attribution      tool invocation        weighted rubric       trend data
           (Co-Authored-By,       (ruff, radon,          (lint 35%,             │
            email matching)        vulture, bandit)        security 30%,        ├──→ REST API
                                                           complexity 35%)     └──→ Dashboard
             ┌────────────┐
             │Diff Analyzer│ ←── v0.2: scores only changed files per commit/branch
             └────────────┘

Per-Commit Scoring (v0.2)

Every commit is scored against only the files it changed, not the entire repo. This makes the agent leaderboard meaningful — a commit that touches 1 clean file scores differently than one that touches 10 messy files.

Diff Mode (v0.2)

arbiter diff scores only files changed since a base branch. Ideal for CI/PR quality gates — fast, scoped, actionable.

Agent Attribution

Arbiter identifies which agent authored each commit:

  1. Co-Authored-By trailerCo-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
  2. Author email — maps noreply@anthropic.com → claude, codex@openai.com → codex
  3. Default — "human" if no agent pattern matches

Configure in agents.yml:

agents:
  - name: claude
    emails: [noreply@anthropic.com]
    co_author_patterns: ["Claude\\s+(Opus|Sonnet|Haiku)"]
    trust_tier: verified
    quality_threshold: 70.0
  - name: gemini
    trust_tier: probation
    quality_threshold: 80.0  # Higher bar for probationary agents

Analyzers (pluggable)

Analyzer Tool What It Finds
Lint ruff Style violations, import errors, bugbear patterns
Complexity radon Cyclomatic complexity (grade A-F per function)
Security bandit Hardcoded secrets, shell injection, dangerous patterns
Dead Code vulture Unused functions, imports, variables
Duplication AST hash Near-duplicate function bodies

Scoring

Deterministic. Same code → same score. Always.

Overall = Lint (35%) + Security (30%) + Complexity (35%)

Penalty points by severity:
  CRITICAL: 50 | HIGH: 20 | MEDIUM: 5 | LOW: 1

Score = 100 - (total_penalty / LOC) * normalization_factor

Grades: A (90+) | B (80+) | C (70+) | D (60+) | F (<60)

Dashboard (v2)

Single HTML file with Chart.js. No build step, no React, no npm.

  • Score Card — Big number + breakdown bars
  • Agent Leaderboard — Who writes the best code? Color-coded by agent
  • Per-Agent Quality Timeline — Score over time per agent (not just repo-wide)
  • Commit Feed — Recent commits with agent, score, changes, timestamp
  • Hotspot Files — Ranked by finding count
  • Fleet View — Multi-repo quality grid with color-coded scores
  • Tabbed UI — Overview, Commits, Fleet tabs

API

GET /api/score                  Current repo score
GET /api/agents                 Agent leaderboard
GET /api/agents/{name}/trend    Per-agent quality over time
GET /api/trend?days=30          Quality over time
GET /api/worst?limit=20         Worst files
GET /api/commits                Recent commits with scores
GET /api/commits/{hash}         Detail for one commit
GET /api/fleet                  Fleet report (multi-repo)
GET /api/health                 System health

CLI Commands

arbiter analyze <repo>                     # Full analysis + per-commit scoring + persist
arbiter score <repo> [--json] [--exclude]  # Quick score (no persist)
arbiter diff <repo> [--base main] [--json] # Score only changed files vs base branch
arbiter agents                             # Agent leaderboard
arbiter trend [--days 30]                  # Quality trend
arbiter worst [--limit 20]                 # Worst files
arbiter commits [--agent claude]           # Recent commits
arbiter audit-fleet <directory>            # Audit all repos in a directory
arbiter fleet-report                       # Fleet quality summary
arbiter triage                             # Auto-classify repos: green/yellow/red/archive
arbiter fix <repo> [--dry-run]             # Auto-fix ruff findings + before/after score
arbiter serve [--port 8080]                # API + dashboard

Tests

pip install ".[test]"
PYTHONPATH=src python -m pytest tests/ -v
# 78 tests, <7 seconds

Requirements

  • Python 3.11+
  • git (for historian)
  • Optional: ruff, radon, vulture, bandit (for full analysis)
  • Docker (for containerized deployment)

License

MIT — see LICENSE.


Built by HUMMBL LLC from production experience coordinating Claude, Codex, Gemini, and human engineers on a 6,000+ test codebase.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arbiter_dev-0.2.0.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arbiter_dev-0.2.0-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file arbiter_dev-0.2.0.tar.gz.

File metadata

  • Download URL: arbiter_dev-0.2.0.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for arbiter_dev-0.2.0.tar.gz
Algorithm Hash digest
SHA256 90dae7b25bace48b1ec1b1c245ab5183f364974b36b7c4f35a6177aeea08606a
MD5 f3c315b8e143d057ef2c61e548ca4306
BLAKE2b-256 7987a21a2a443bc6744c371c0ab4bf9f90f47922a007951fa1645e178cf65fe6

See more details on using hashes here.

File details

Details for the file arbiter_dev-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: arbiter_dev-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for arbiter_dev-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3360f2f9f95a12cdb9df1951f22b288d5b857c16c83358dbf483f096cd3dea2
MD5 f7a807810d6d0e8ae0a4c9fb1d1def57
BLAKE2b-256 05ddb30cac372a192cec09a4f3a8acaedc79281f70e7f4414f619afbbbed7a1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page