Detect duplicate PRs in GitHub repos
Project description
bepo
Detect duplicate pull requests in GitHub repos.
No ML, no embeddings, no API keys. Just static analysis of diffs.
A maintainer with 100 open PRs can run bepo check --repo foo/bar and in 5 minutes get a ranked list of "you should look at these pairs." That saves hours of manual review.
The Problem
Large repos waste engineering time on duplicate PRs. When multiple contributors fix the same bug independently, only one PR gets merged — the rest is wasted effort.
This actually happens. We analyzed 100 PRs from OpenClaw and found:
| Cluster | PRs | What happened |
|---|---|---|
| Matrix startup bug | 4 PRs | 4 engineers independently fixed startupGraceMs = 0 → 5000 |
| Media token regex | 2 PRs | Identical fix submitted twice |
| Feishu bitable config | 2 PRs | Same multi-account config fix |
8 duplicate PRs across 3 bug fixes. That's real engineering time wasted.
Proof: OpenClaw Analysis
We ran bepo on OpenClaw's open PRs. Here's what it found:
$ bepo check --repo openclaw/openclaw --limit 100
#20025 <-> #19973
Similarity: 86%
Reason: Both fix #19843 ← Same issue!
#19868 <-> #19855
Similarity: 81%
Reason: Same files: parse.ts, pi-embedded-subscribe.tools.ts
#19871 <-> #19853
Similarity: 100%
Reason: Same files: bitable.ts, config-schema.ts, tools-config.ts
Verified manually:
| PR Pair | Similarity | Verdict |
|---|---|---|
| #20025 ↔ #19973 (Matrix) | 86% | ✅ TRUE DUPLICATE — both change startupGraceMs from 0 to 5000 |
| #19868 ↔ #19855 (regex) | 81% | ✅ TRUE DUPLICATE — identical PR titles |
| #19871 ↔ #19853 (Feishu) | 100% | ✅ TRUE DUPLICATE — same files, same fix |
| #19996 ↔ #19993 (unrelated) | 20% | ✅ Correctly NOT flagged |
Precision: 80% (4/5 flagged clusters were true duplicates)
More Examples
VSCode — Found PRs touching same files for same feature:
#295823 <-> #295822
Similarity: 77%
Reason: Same files: chatModel.ts, chatForkActions.ts
Both: "Use metadata flag for fork detection"
Next.js — Found related test updates:
#90121 <-> #90120
Similarity: 86%
Reason: Same files: test/
Install
pip install bepo
Requires GitHub CLI (gh) to be installed and authenticated.
Usage
# Check a repo for duplicate PRs
bepo check --repo owner/repo
# Adjust sensitivity (default: 0.4, higher = stricter)
bepo check --repo owner/repo --threshold 0.5
# Check more PRs
bepo check --repo owner/repo --limit 100
# JSON output for CI
bepo check --repo owner/repo --json
How It Works
bepo fingerprints each PR by extracting:
| Signal | Weight | What it catches |
|---|---|---|
| Same issue ref (#123) | 10.0 | Definite duplicate |
| Same code changes | 8.0 | Identical lines added/removed |
| Same files touched | 6.0 | PRs modifying same code |
| Same feature domain | 3.0 | auth, messaging, database, etc. |
| Same imports | 1.0 | Similar dependencies |
Then computes pairwise Jaccard similarity.
That's it. No embeddings, no LLM calls. Just:
- Parse
+++ b/pathfrom diffs - Regex for
#\d+issue refs - Compare actual code changes
- Set intersection for similarity
~300 lines of Python.
As a Library
from bepo import fingerprint_pr, find_duplicates
# Fingerprint PRs
fp1 = fingerprint_pr("#123", diff1, title="Fix auth", body="Fixes #456")
fp2 = fingerprint_pr("#124", diff2, title="Auth fix", body="Fixes #456")
# Find duplicates
dups = find_duplicates([fp1, fp2], threshold=0.4)
for d in dups:
print(f"{d.pr_a} ↔ {d.pr_b}: {d.similarity:.0%}")
print(f" Shared issues: {d.shared_issues}")
print(f" Shared files: {d.shared_files}")
GitHub Action
name: PR Duplicate Check
on: [pull_request]
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install bepo
- run: bepo check --repo ${{ github.repository }} --json
env:
GH_TOKEN: ${{ github.token }}
Why This Works
Duplicates share obvious signals:
- Same code = Identical changes (639 shared lines caught SoundChain duplicates)
- Same issue ref = Same bug report (#19843 appeared in 4 Matrix PRs)
- Same files = Same bug location (100% overlap for Feishu cluster)
Code overlap and issue refs catch most duplicates. Simple works.
Origin Story
This tool was vibe-coded in a single session with Claude.
We tried a few approaches and kept finding that simpler signals outperformed fancier ones. File overlap and issue refs catch most duplicates. Sometimes the obvious solution is the right one.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bepo-0.2.0.tar.gz.
File metadata
- Download URL: bepo-0.2.0.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68075353851678a3a3490e494df46de689760f7fa63853bde475a359932d1a80
|
|
| MD5 |
9914bf4a9f3444bafe3c6c0def805b39
|
|
| BLAKE2b-256 |
b24a8829be74366e3a73ed91204375907bce3f30d3f0ffe1256c12bf28112c7a
|
File details
Details for the file bepo-0.2.0-py3-none-any.whl.
File metadata
- Download URL: bepo-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3b650a597d0f07a945b8818aea176aec82999eec4423a481cefd037dbd1c8fa
|
|
| MD5 |
476f6b7bd318f3ab702c76f32aeaf4f1
|
|
| BLAKE2b-256 |
a02964b2153346120099ae1f48a28efe9a4f7a21864b29e06bbb15a24f2fef11
|