Skip to main content

CI for AI-generated code — measures production readiness, not just correctness

Project description

QualBench — CI for AI-Generated Code

AI Cost Tracking

AI Cost AI Model

This project uses AI-generated code. Total cost: $2.8500 with 19 AI commits.

Generated on 2026-04-09 using openrouter/qwen/qwen3-coder-next


Correct code is not the same as mergeable code. eslint + code review, but for AI. Add to your pipeline in 2 minutes.

License: Apache 2.0 Dataset: v0 CI


60 seconds to your first score

pip install qualbench
qualbench quickstart

No config, no API keys. QualBench evaluates your current diff and prints a Quality Score.

Add to CI in 2 minutes

# .github/workflows/qualbench.yml
name: QualBench
on: [pull_request]
jobs:
  quality-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: semcod/qualbench-action@v1
        with:
          tool: prollama
          fail_on_score: 70

Every AI-generated PR gets a quality review comment. Set fail_on_score and the pipeline fails if quality is below your threshold.

🧠 QualBench Review

Quality Score: 78/100

  ❌ Complexity increased (+12%)
  ⚠ Security: 1 new medium-severity finding
  ✔ Tests pass, no regressions

Verdict: needs_review

CI/CD Examples

GitHub Action (recommended)

# .github/workflows/qualbench.yml
name: QualBench
on: [pull_request]
jobs:
  quality-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: semcod/qualbench-action@v1
        with:
          tool: prollama
          fail_on_score: 70

GitLab CI

# .gitlab-ci.yml
qualbench:
  stage: test
  image: python:3.12-slim
  before_script:
    - pip install qualbench
  script:
    - qualbench run --tool prollama --json --fail-on-score 70
  only:
    - merge_requests

Azure DevOps

# azure-pipelines.yml
steps:
  - task: UsePythonVersion@0
    inputs:
      versionSpec: '3.12'
  - script: |
      pip install qualbench
      qualbench run --tool prollama --json --fail-on-score 70
    displayName: 'QualBench Quality Check'

Jenkins

// Jenkinsfile
stage('Quality Check') {
    steps {
        sh '''
            pip install qualbench
            qualbench run --tool prollama --fail-on-score 70
        '''
    }
}

CircleCI

# .circleci/config.yml
version: 2.1
jobs:
  quality:
    docker:
      - image: python:3.12-slim
    steps:
      - checkout
      - run: pip install qualbench
      - run: qualbench run --tool prollama --fail-on-score 70
workflows:
  quality-check:
    jobs:
      - quality

The problem

AI coding tools resolve 70–80% of benchmark tasks. But most AI-generated PRs are not mergeable without human fixes. Every existing benchmark asks "do tests pass?" — nobody asks "would a senior developer approve this PR?"

Six dimensions of production readiness

Dimension What it measures Weight
Correctness All tests pass, no regressions 25%
Mergeability Would a senior dev merge this? (1–5) 25%
Security New vulnerabilities introduced 15%
Code quality Complexity delta, dead code 15%
Iterations Attempts to reach acceptable output 10%
Cost efficiency USD per successful patch 10%

Verdicts: ready_to_merge (≥85), needs_review (65–84), not_merge_ready (<65).

CLI

qualbench run --tool prollama          # score current diff
qualbench run --tool prollama --json   # portable JSON output
qualbench run --mode cheap             # lowest-cost models
qualbench quickstart                   # first score in 60 seconds
qualbench compare my_tool              # vs leaderboard
qualbench info                         # dataset summary
qualbench doctor                       # check dependencies

One portable format everywhere

CLI, API, GitHub Action — same JSON schema. See docs/schema.md.

Adding your tool

cp runners/template.py runners/my_tool.py
# Implement run() → return portable schema
qualbench run --tool my_tool
# Submit PR with results

License

Licensed under Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qualbench-0.3.1.tar.gz (335.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qualbench-0.3.1-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file qualbench-0.3.1.tar.gz.

File metadata

  • Download URL: qualbench-0.3.1.tar.gz
  • Upload date:
  • Size: 335.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for qualbench-0.3.1.tar.gz
Algorithm Hash digest
SHA256 9f8e981007fde5d2adedd4cf820854bd7417331c3572b5991b5753b34430bac4
MD5 1782aa5647aa8f93dc252e92d84695e3
BLAKE2b-256 15ca5cfa497e76c376b2afd957d8963b79c17a3bf54d1317009febbc4179843e

See more details on using hashes here.

File details

Details for the file qualbench-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: qualbench-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for qualbench-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a82b757ba80430fc98526f45f711713b5cd8c84d6c5644e486fa93d9f3a38b1c
MD5 6a430f7f723a6540c2ea53b8515caa63
BLAKE2b-256 30ad44cd45efba64fd2326c36c2883b0dbef0e4264b23b142fb2d4def1bfd9b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page