Skip to main content

Turn real GitHub issues into small, reproducible coding-agent benchmark tasks.

Project description

中文文档

IssueBenchKit

Turn a real GitHub issue, pull request, or local bug into a small coding-agent benchmark task.

SWE-bench is great when you want a public leaderboard. Most teams need something smaller: a repeatable task built from the bugs they actually care about, with a clear test command and a report that says whether a candidate patch really fixed it.

IssueBenchKit is that local builder. It does not try to invent tests for you. It packages the issue context, base commit, reproduction command, and scoring result so you can evaluate coding agents on your own repositories.

Quick Start

pip install issuebenchkit

Create a benchmark task:

issuebench init tasks/qwen-copy \
  --repo ./qwen-code \
  --issue https://github.com/QwenLM/qwen-code/issues/4716 \
  --base 8b4f3b2 \
  --test "npm test -- copyCommand.test.ts"

Run the task against a candidate checkout:

issuebench run tasks/qwen-copy --repo ./candidate-qwen-code --out after.json

Compare before and after:

issuebench score tasks/qwen-copy --before before.json --after after.json

Export a report:

issuebench export tasks/qwen-copy --format html --out report.html

What It Stores

Each task directory contains one issuebench.json manifest:

  • source repo path and optional GitHub issue URL
  • base commit or version marker
  • reproduction / validation command
  • expected signal, notes, and tags

Run results are plain JSON files with exit code, duration, command, stdout tail, stderr tail, and the pass/fail verdict. They are easy to archive, diff, or attach to a PR.

Why Not Just Use SWE-bench?

Use SWE-bench for public comparison. Use IssueBenchKit when you need:

  • a benchmark task for a private or small repo
  • a tiny task that can run in CI
  • a before/after report for one real bug
  • a dataset of issues that reflects your own engineering workflow

Current Scope

The first version is intentionally small:

  • generic shell test commands
  • JSON manifest files
  • before/after scoring
  • JSONL and single-file HTML export

It does not generate tests automatically, mutate repositories, or claim that one command can evaluate every language ecosystem.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

issuebenchkit-0.1.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

issuebenchkit-0.1.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file issuebenchkit-0.1.0.tar.gz.

File metadata

  • Download URL: issuebenchkit-0.1.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for issuebenchkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e16718584f69f1b8256ec75030d669b5f65626dc7b2413d15c21fb1ac5bb1de9
MD5 4d20f41d73ce587ee115c3ca8d12430a
BLAKE2b-256 a33e1c26e39d7ed611a5c1dc8b9fe00a33a835483e9c06ff00b2410702033493

See more details on using hashes here.

Provenance

The following attestation bundles were made for issuebenchkit-0.1.0.tar.gz:

Publisher: publish.yml on he-yufeng/IssueBenchKit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file issuebenchkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: issuebenchkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for issuebenchkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f5eea95d19b9b7f0887f620b42f19f3ecad75cdde2bf11e17df905038fc128a
MD5 4edc525fe13a49fa42100ac823cdd640
BLAKE2b-256 876b5d7ef923ebd6efa38c6bbe7b54d823d7bcf01879011815d2e3ba66e047eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for issuebenchkit-0.1.0-py3-none-any.whl:

Publisher: publish.yml on he-yufeng/IssueBenchKit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page