Detect duplicate PRs in GitHub repos
Project description
bepo
Detect duplicate pull requests in GitHub repos.
No ML, no embeddings, no API keys. Just static analysis of diffs.
A maintainer with 100 open PRs can run bepo check --repo foo/bar and in 5 minutes get a ranked list of "you should look at these pairs." That saves hours of manual review.
The Problem
Large repos waste engineering time on duplicate PRs. When multiple contributors fix the same bug independently, only one PR gets merged — the rest is wasted effort.
This actually happens. We analyzed 100 PRs from OpenClaw and found:
| Cluster | PRs | What happened |
|---|---|---|
| Matrix startup bug | 4 PRs | 4 engineers independently fixed startupGraceMs = 0 → 5000 |
| Media token regex | 2 PRs | Identical fix submitted twice |
| Feishu bitable config | 2 PRs | Same multi-account config fix |
8 duplicate PRs across 3 bug fixes. That's real engineering time wasted.
Proof: OpenClaw Analysis
30-day window (recommended)
$ bepo check --repo openclaw/openclaw --since 30d
Found 52 potential duplicates:
#20472 <-> #20491
Similarity: 100%
Reason: Both fix #20468 ← Same Nextcloud Talk restart bug, two fixes
#19595 <-> #19624
Similarity: 90%
Reason: Both fix #19574 ← Identical PR titles, same elevatedDefault bug
#20419 <-> #20441
Similarity: 83%
Reason: Both fix #20410 ← Same WebChat markdown rendering fix
#19865 <-> #19945
Similarity: 97%
Reason: Same code: 100 lines overlap ← Two embedding provider PRs duplicating core logic
#19770 <-> #20317
Similarity: 87%
Reason: Same code: 9 lines overlap ← Same "hide tool calls" UI toggle, added twice
Precision: ~88% — verified by manual classification of all 52 pairs.
The remaining ~12% are concurrent PRs touching the same structural code — two provider integrations sharing schema boilerplate, two locale additions hitting the same type. The kind of overlap a reviewer would want to know about.
Full backlog (3,000 PRs)
$ bepo check --repo openclaw/openclaw --limit 3000
Analyzed 3000 PRs in 9.8s
Found 1022 potential duplicates:
#17518 <-> #17653
Similarity: 100%
Reason: Both fix #17499 ← Identical browser dialog fix submitted twice
#12936 <-> #19050
Similarity: 80%
Reason: Same code: 10 lines overlap ← Same Telegram thread_id fix, one is literally "v2"
#15512 <-> #18994
Similarity: 67%
Reason: Same code: 10 lines overlap ← Both normalize Brave search language codes
#14182 <-> #15051
Similarity: 75%
Reason: Same code: 2403 lines overlap ← Two Zulip implementations duplicating core logic
Precision by similarity band, verified by manual sampling:
| Band | Pairs | Precision | Notes |
|---|---|---|---|
| 100% | 221 | ~75% | High code overlap but tiny sets — watch for short boilerplate |
| 80–90% | 318 | ~95% | Best signal — issue refs + code together |
| 70–79% | 283 | ~90% | Strong structural duplicates |
| 65–69% | 200 | ~70% | Noisier; raise --threshold 0.75 to cut this band |
| Overall | 1022 | ~84% |
For large backlogs, --threshold 0.75 drops to ~550 pairs at ~92% precision. --since 30d gives 52 actionable pairs at ~88% — the recommended default.
More Examples
VSCode — Found PRs touching same files for same feature:
#295823 <-> #295822
Similarity: 77%
Reason: Same files: chatModel.ts, chatForkActions.ts
Both: "Use metadata flag for fork detection"
Next.js — Found related test updates:
#90121 <-> #90120
Similarity: 86%
Reason: Same files: test/
Install
pip install bepo
Requires GitHub CLI (gh) to be installed and authenticated.
Usage
# Check a repo for duplicate PRs
bepo check --repo owner/repo
# Check recent PRs (recommended — avoids stale noise)
bepo check --repo owner/repo --since 30d
# Adjust sensitivity (default: 0.65, higher = stricter)
bepo check --repo owner/repo --threshold 0.7
# JSON output for CI
bepo check --repo owner/repo --json
How It Works
bepo fingerprints each PR by extracting:
| Signal | Weight | What it catches |
|---|---|---|
| Same issue ref (#123) | 10.0 | Definite duplicate |
| Same code changes (IDF-weighted) | 8.0 | Rare lines weighted more than common boilerplate |
| Same files touched | 6.0 | PRs modifying same code |
| Same feature domain | 3.0 | auth, messaging, database, etc. |
| Same imports | 1.0 | Similar dependencies |
Then computes pairwise Jaccard similarity.
That's it. No embeddings, no LLM calls. Just:
- Parse
+++ b/pathfrom diffs - Regex for
#\d+issue refs - Compare actual code changes
- Set intersection for similarity
Cross-component filtering suppresses boilerplate FPs in integration/plugin monorepos (e.g. Home Assistant, VSCode extensions). Two unrelated integrations sharing scaffold code (config_flow.py, manifest.json) are filtered out when each PR is concentrated in a different component subtree. Pairs sharing a GitHub issue ref always bypass this filter.
~2,000 lines of Python.
As a Library
from bepo import fingerprint_pr, find_duplicates
# Fingerprint PRs
fp1 = fingerprint_pr("#123", diff1, title="Fix auth", body="Fixes #456")
fp2 = fingerprint_pr("#124", diff2, title="Auth fix", body="Fixes #456")
# Find duplicates
dups = find_duplicates([fp1, fp2], threshold=0.65)
for d in dups:
print(f"{d.pr_a} ↔ {d.pr_b}: {d.similarity:.0%}")
print(f" Shared issues: {d.shared_issues}")
print(f" Shared files: {d.shared_files}")
GitHub Action
Add to your repo to automatically detect duplicate PRs and post a warning comment:
name: PR Duplicate Check
on: [pull_request]
jobs:
check-duplicates:
runs-on: ubuntu-latest
steps:
- uses: aardpark/bepo@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
threshold: '0.65' # optional, default 0.65
When a PR is opened that looks like a duplicate, bepo posts a comment:
⚠️ Potential Duplicate PRs Detected
This PR may be similar to existing open PRs:
PR Similarity Reason #123 85% Both fix #456 #124 71% Same code: 10 lines overlap
Detected by bepo
Action Inputs
| Input | Description | Default |
|---|---|---|
github-token |
GitHub token for API access | ${{ github.token }} |
threshold |
Similarity threshold (0.0-1.0) | 0.65 |
limit |
Max PRs to compare against | 50 |
comment |
Post comment on PR | true |
Action Outputs
| Output | Description |
|---|---|
has_duplicates |
true if duplicates found |
match_count |
Number of matches |
matches |
JSON array of matches |
Why This Works
Duplicates share obvious signals:
- Same code = Identical changes (639 shared lines caught SoundChain duplicates)
- Same issue ref = Same bug report (#19843 appeared in 4 Matrix PRs)
- Same files = Same bug location (100% overlap for Feishu cluster)
IDF weighting makes rare lines matter more than common boilerplate. A shared startupGraceMs = 5000 is a stronger signal than a shared return null.
Code overlap and issue refs catch most duplicates. Simple works.
Origin Story
This tool was vibe-coded in a single session with Claude.
We tried a few approaches and kept finding that simpler signals outperformed fancier ones. File overlap and issue refs catch most duplicates. Sometimes the obvious solution is the right one.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bepo-1.0.0.tar.gz.
File metadata
- Download URL: bepo-1.0.0.tar.gz
- Upload date:
- Size: 29.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c3bff486d79ae28e5767ef9ea8f71aeb5d47c6b24ca4b89cf88eeaacceb6de1
|
|
| MD5 |
352e908ac9ab0ee11896e42e195a38c4
|
|
| BLAKE2b-256 |
3ba05a44097d4b88484fe1a355b8515474f671919d3d828f634d2c1ff29d73ae
|
File details
Details for the file bepo-1.0.0-py3-none-any.whl.
File metadata
- Download URL: bepo-1.0.0-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7336b827a921a4b7076ef149dcbf5f9e2e7252863e6335ff63a7e6915513c43
|
|
| MD5 |
b7ba56efc609cb75342106a6d6daa6aa
|
|
| BLAKE2b-256 |
6a5ff640732a5ef4c267e78aa6511040015888fbef85ce69e2c033a2787202b5
|