Skip to main content

Detect duplicate PRs in GitHub repos

Project description

bepo

Detect duplicate pull requests in GitHub repos.

No ML, no embeddings, no API keys. Just static analysis of diffs.

A maintainer with 100 open PRs can run bepo check --repo foo/bar and in 5 minutes get a ranked list of "you should look at these pairs." That saves hours of manual review.

The Problem

Large repos waste engineering time on duplicate PRs. When multiple contributors fix the same bug independently, only one PR gets merged — the rest is wasted effort.

This actually happens. We analyzed 100 PRs from OpenClaw and found:

Cluster PRs What happened
Matrix startup bug 4 PRs 4 engineers independently fixed startupGraceMs = 05000
Media token regex 2 PRs Identical fix submitted twice
Feishu bitable config 2 PRs Same multi-account config fix

8 duplicate PRs across 3 bug fixes. That's real engineering time wasted.

Proof: OpenClaw Analysis

We ran bepo on OpenClaw's open PRs. Here's what it found:

$ bepo check --repo openclaw/openclaw --limit 100

#20025 <-> #19973
  Similarity: 86%
  Reason: Both fix #19843      ← Same issue!

#19868 <-> #19855
  Similarity: 81%
  Reason: Same files: parse.ts, pi-embedded-subscribe.tools.ts

#19871 <-> #19853
  Similarity: 100%
  Reason: Same files: bitable.ts, config-schema.ts, tools-config.ts

Verified manually:

PR Pair Similarity Verdict
#20025 ↔ #19973 (Matrix) 86% ✅ TRUE DUPLICATE — both change startupGraceMs from 0 to 5000
#19868 ↔ #19855 (regex) 81% ✅ TRUE DUPLICATE — identical PR titles
#19871 ↔ #19853 (Feishu) 100% ✅ TRUE DUPLICATE — same files, same fix
#19996 ↔ #19993 (unrelated) 20% ✅ Correctly NOT flagged

Precision: 80% (4/5 flagged clusters were true duplicates)

More Examples

VSCode — Found PRs touching same files for same feature:

#295823 <-> #295822
  Similarity: 77%
  Reason: Same files: chatModel.ts, chatForkActions.ts

  Both: "Use metadata flag for fork detection"

Next.js — Found related test updates:

#90121 <-> #90120
  Similarity: 86%
  Reason: Same files: test/

Install

pip install bepo

Requires GitHub CLI (gh) to be installed and authenticated.

Usage

# Check a repo for duplicate PRs
bepo check --repo owner/repo

# Adjust sensitivity (default: 0.4, higher = stricter)
bepo check --repo owner/repo --threshold 0.5

# Check more PRs
bepo check --repo owner/repo --limit 100

# JSON output for CI
bepo check --repo owner/repo --json

How It Works

bepo fingerprints each PR by extracting:

Signal Weight What it catches
Same issue ref (#123) 10.0 Definite duplicate
Same code changes (IDF-weighted) 8.0 Rare lines weighted more than common boilerplate
Same files touched 6.0 PRs modifying same code
Same feature domain 3.0 auth, messaging, database, etc.
Same imports 1.0 Similar dependencies

Then computes pairwise Jaccard similarity.

That's it. No embeddings, no LLM calls. Just:

  • Parse +++ b/path from diffs
  • Regex for #\d+ issue refs
  • Compare actual code changes
  • Set intersection for similarity

~300 lines of Python.

As a Library

from bepo import fingerprint_pr, find_duplicates

# Fingerprint PRs
fp1 = fingerprint_pr("#123", diff1, title="Fix auth", body="Fixes #456")
fp2 = fingerprint_pr("#124", diff2, title="Auth fix", body="Fixes #456")

# Find duplicates
dups = find_duplicates([fp1, fp2], threshold=0.4)
for d in dups:
    print(f"{d.pr_a}{d.pr_b}: {d.similarity:.0%}")
    print(f"  Shared issues: {d.shared_issues}")
    print(f"  Shared files: {d.shared_files}")

GitHub Action

name: PR Duplicate Check
on: [pull_request]

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install bepo
      - run: bepo check --repo ${{ github.repository }} --json
        env:
          GH_TOKEN: ${{ github.token }}

Why This Works

Duplicates share obvious signals:

  • Same code = Identical changes (639 shared lines caught SoundChain duplicates)
  • Same issue ref = Same bug report (#19843 appeared in 4 Matrix PRs)
  • Same files = Same bug location (100% overlap for Feishu cluster)

IDF weighting makes rare lines matter more than common boilerplate. A shared startupGraceMs = 5000 is a stronger signal than a shared return null.

Code overlap and issue refs catch most duplicates. Simple works.

Origin Story

This tool was vibe-coded in a single session with Claude.

We tried a few approaches and kept finding that simpler signals outperformed fancier ones. File overlap and issue refs catch most duplicates. Sometimes the obvious solution is the right one.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bepo-0.4.0.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bepo-0.4.0-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file bepo-0.4.0.tar.gz.

File metadata

  • Download URL: bepo-0.4.0.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for bepo-0.4.0.tar.gz
Algorithm Hash digest
SHA256 710aaf52b8eff3c0ee44c1f388369499a7abd501f5d76508485bfe8b000550d9
MD5 da0bb639e5ff48cc7ab2a013f443111e
BLAKE2b-256 7470a8cac188962f43af4d6b6e69beb4a5afd81a913c958b04bfefbe57a66654

See more details on using hashes here.

File details

Details for the file bepo-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: bepo-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for bepo-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6959c47d0a5ec2bf4f1730b3c8885058855064cc463cee9e548545be8067b6d8
MD5 abf47e78429939e9ffed93eab2807d84
BLAKE2b-256 303c54ecbd9fd2d45b23a517329e1342bd2db94f928c97fe36fe47cc0b4fc1dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page