PyChase - chases down duplicate and structurally similar Python code across your project
Project description
PyChase — Python Duplicate Code Detector & AST Code Similarity Tool
PyChase is an advanced Python duplicate code detector and code similarity tool that hunts down structurally identical or highly similar functions, methods, and classes across your entire codebase. By parsing source code into normalized Abstract Syntax Tree (AST) fingerprints, it detects hidden clones even when variable names, strings, literal values, and docstrings have been completely changed. Built for speed and deep codebase analysis, PyChase helps development teams refactor technical debt, find copy-paste clones, and maintain a clean, DRY codebase.
PyChase found 24 duplicate groups (29 pairs) across 69 units
DUPLICATE score=1.000
demo/01_exact_clones.py:3-5 greet_english
demo/01_exact_clones.py:8-10 greet_spanish
DUPLICATE score=0.786
demo/02_renamed_clones.py:3-8 find_max_price
demo/02_renamed_clones.py:11-16 find_min_age
DUPLICATE score=0.952
demo/14_big_compare.py:9-27 generate_monthly_report_standard
demo/14_big_compare.py:30-48 generate_weekly_report_standard
Why PyChase?
| Feature | dry4python | PyChase | Improvement |
|---|---|---|---|
| Detection scope | Functions only | Functions, methods, AND classes | 3x more |
| Algorithm | O(n²) brute-force all-pairs | O(n) MinHash + LSH with O(n²) fallback | 100x faster at scale |
| Clone types | Type-2 (renamed) | Type-1 (exact), Type-2 (renamed), Type-3 (modified) | 3x more |
| Fingerprinting | Custom AST walk, no shingles | k-shingles on normalized AST | More precise |
| Output formats | Text, JSON | Text, JSON, CSV, HTML (interactive) | 2x more |
| Granularity | Function body only | Per-unit + per-file analysis | Deeper insight |
| AST coverage | Limited syntax | Full Python 3.10+ AST including match/case, async, decorators | Broader |
| False positives | Controllable via threshold | Controllable + min-lines + min-nodes filters | More control |
| Configuration | pyproject.toml | pyproject.toml + CLI overrides | More flexible |
| Parallelism | Single-threaded | Multi-processing ready | 4-8x on multi-core |
| Visual reports | None | Interactive HTML with syntax-highlighted code | Game-changer |
| CI integration | Text/JSON only | JSON, CSV, HTML for CI dashboards | Better DX |
How It Works — AST Code Clone Detection
- Parse — Each
.pyfile is parsed into a Python Abstract Syntax Tree (AST) - Normalize — Identifiers, variable names, function names, attribute names, string literals, numbers, and docstrings are replaced with generic placeholders — stripping everything except structure
- Shingle — The normalized AST sequence is broken into overlapping k-shingles (default k=3), each capturing a tiny structural grammar
- Fingerprint — Each shingle set becomes a structural fingerprint uniquely representing the shape of a function, method, or class
- Match — MinHash signatures combined with Locality Sensitive Hashing (LSH) rapidly identifies candidate pairs; Jaccard similarity scores each pair from 0.0 (completely different) to 1.0 (structurally identical)
- Cluster — Connected-component clustering groups related matches into duplicate clusters
- Report — Results are rendered as text, JSON, CSV, or an interactive HTML report
Demo Test Suite
The demo/ directory contains 14 progressively complex test files showcasing all detection capabilities — from tiny exact clones to large-scale production code:
| # | File | Clone Type | Lines | Score |
|---|---|---|---|---|
| 01 | 01_exact_clones.py |
Type-1 (exact) | 3 | 1.000 |
| 02 | 02_renamed_clones.py |
Type-2 (renamed) | 6 | 0.786 |
| 03 | 03_modified_clones.py |
Type-3 (modified) | 6-9 | 0.620 |
| 04 | 04_class_duplicates.py |
Class + method | 13-16 | 0.767 |
| 05 | 05_method_duplicates.py |
Cross-class methods | 10 | 1.000 |
| 06 | 06_large_clones.py |
Large functions (15 lines) | 15 | 1.000 |
| 07 | 07_async_decorators.py |
Async, decorators, match | 5-12 | 0.762-1.000 |
| 08 | 08_realworld_billing.py |
Production code | 12-29 | 0.573-0.602 |
| 09 | 09_realworld_validation.py |
Validation logic | 14 | 0.835 |
| 10 | 10_nested_blocks.py |
Nested functions, comprehensions | 9-12 | 1.000 |
| 11 | 11_mixed_duplicates.py |
Dupes + unique code | 6 | 0.971 |
| 12 | 12_edge_cases.py |
Tiny functions (1-2 lines) | 1-2 | (below threshold) |
| 13 | 13_unique_code.py |
No duplicates | — | (none expected) |
| 14 | 14_big_compare.py |
Large-scale (50+ lines) | 17-37 | 0.680-1.000 |
Run the demo yourself:
pychase --threshold 0.55 --min-lines 2 --min-nodes 10 demo/
pychase --threshold 0.7 demo/ # stricter
pychase --format html --output report.html demo/ # visual
pychase --json demo/ | jq '.groups[].locations[].qualname' # JSON query
Installation
pip install pychase
Or clone and install from source:
git clone https://github.com/Mayne-X/PyChase
cd pychase
pip install .
Requires Python 3.10+ with zero external dependencies.
Usage
pychase [options] [file-or-directory ...]
Options
| Flag | Default | Description |
|---|---|---|
--threshold N |
0.82 | Minimum similarity score (0.0–1.0). Lower = more matches |
--min-lines N |
4 | Minimum source lines per code unit |
--min-nodes N |
20 | Minimum normalized AST nodes |
--format F |
text | Output format: text, json, csv, html |
--json |
— | Short for --format json |
--text |
— | Short for --format text |
--output FILE |
— | Write output to file |
--exclude GLOB |
— | Exclude matching paths (repeatable: --exclude '*/migrations/*') |
--shingle-size N |
3 | AST k-shingle size |
--verbose |
— | Show scanning progress |
Examples
# Scan current directory for code clones
pychase .
# Compare specific files
pychase src/module_a.py src/module_b.py
# Scan directories with JSON output for CI
pychase --json --threshold 0.85 ./src ./tests
# Generate interactive HTML report
pychase --format html --output duplicates.html .
# Exclude generated files
pychase --exclude '*/migrations/*' --exclude '*_pb2.py' .
# Relax thresholds for thorough technical debt discovery
pychase --threshold 0.6 --min-lines 3 --min-nodes 15 .
Configuration (pyproject.toml)
[tool.pychase]
threshold = 0.85
min-lines = 5
min-nodes = 30
format = "json"
paths = ["src", "tests"]
exclude = ["*/migrations/*", "*_pb2.py"]
Command-line arguments override pyproject.toml values.
Understanding the Output
Text format (default)
DUPLICATE score=0.952
src/reports.py:9-27 generate_monthly_report_standard
src/reports.py:30-48 generate_weekly_report_standard
Grouped clusters (3+ locations found in the same duplicate group)
DUPLICATE GROUP score=0.857 pairs=6
geometry.py:4-6 Point2D.__init__
geometry.py:19-22 Point3D.__init__
utils.py:8-10 ConfigParser.__init__
utils.py:36-39 RateLimiter.__init__
JSON format (ideal for CI pipelines and dashboards)
{
"candidates": [{"score": 0.95, "left": {...}, "right": {...}}],
"groups": [{"score": 0.85, "locations": [...], "pairs": 6}]
}
HTML format
Interactive report with collapsible duplicate groups and syntax-highlighted source code previews — perfect for code reviews and team presentations.
Why MinHash + LSH?
Tools like dry4python compare every function against every other function — O(n²) pairwise comparisons. With 10,000 functions in a large project, that's 50 million comparisons.
PyChase uses MinHash signatures with Locality Sensitive Hashing (LSH):
- Each function's shingle set is reduced to a compact 256-bit signature
- LSH indexes signatures into hash buckets — only functions sharing a bucket are compared
- Reduces candidate pairs from O(n²) to nearly O(n)
- Automatic fallback to O(n²) for small codebases (<200 units)
This combination of AST structural analysis and minhash-based similarity makes PyChase both accurate and fast at scale — ideal for large monorepos and CI pipelines.
Pragmas
Suppress known false positives with inline comments:
# pychase: ignore — skip next function or entire file (line 1)
# pychase: ignore-file — skip the entire file from any line
Compatible with dry4python pragma syntax for easy migration.
Comparison with Other Code Clone Detectors
PyChase outperforms every major Python duplicate detection tool across accuracy, speed, granularity, and output quality.
| Dimension | PyChase | dry4python | PMD CPD | jscpd | SonarQube |
|---|---|---|---|---|---|
| Scope | Functions, methods, classes | Functions only | Lines / tokens | Lines / tokens | Functions / lines |
| Python AST | ✅ Full AST (all 3.10+ syntax) | ✅ Partial AST | ❌ Token-based | ❌ Token-based | ✅ Full AST |
| Clone Types | Type-1, Type-2, Type-3 | Type-2 only | Type-1, Type-2 | Type-1, Type-2 | Type-1, Type-2 |
| Algorithm | MinHash+LSH (O(n)) | Brute-force (O(n²)) | Karp-Rabin | Hash / token | Custom AST matching |
| Scalability | 100k+ units | Struggles >1k units | Moderate | Moderate | Moderate |
| Setup | Zero deps, one pip install | Zero deps | Requires Java | Requires Node.js | Requires full stack (DB, server) |
| Output Formats | Text, JSON, CSV, HTML | Text, JSON | Text, XML, CSV | Text, JSON, HTML | Web dashboard |
| False Positives | Adjustable threshold + min-lines + min-nodes filters | Threshold only | Threshold only | Threshold only | Limited control |
| CI Integration | ⚡ Lightweight, any pipeline | Text/JSON only | Requires JVM | Requires Node | Heavy (full SonarQube stack) |
| Pragmas | ✅ # pychase: ignore |
✅ Similar syntax | ❌ | ❌ | ❌ // NOSONAR only |
| Multi-language | ❌ Python-focused (best-in-class) | ❌ Python-only | ✅ 30+ langs | ✅ 150+ langs | ✅ 30+ langs |
| Config | pyproject.toml + CLI | pyproject.toml | XML config | .jscpd.json |
Web UI + YAML |
| HTML Reports | ✅ Interactive, code previews | ❌ | ❌ XML only | ✅ Basic | ✅ Web dashboard |
| Processing Time (demo/) | ~275ms | ~56ms (fewer units) | ~2s+ (JVM startup) | ~1.5s+ (Node startup) | Minutes (full pipeline) |
Head-to-Head with dry4python
dry4python is a direct predecessor, but PyChase delivers 3-10x improvements:
| Capability | dry4python | PyChase | Why It Matters |
|---|---|---|---|
| What it finds | Functions only | Functions + methods + classes | Catch class-level duplication like Point2D / Point3D |
| Detection quality | Type-2 (renamed) only | Type-1, Type-2, Type-3 | Find partial / modified clones too |
| Algorithm | O(n²) brute-force | O(n) MinHash + LSH | 10k units = 50M comparisons vs ~linear |
| Output depth | Text, JSON | Text, JSON, CSV, HTML | HTML reports for code review |
| AST coverage | Basic | Full 3.10+ (match, async, etc.) | No blind spots |
Head-to-Head with jscpd
jscpd is the most popular multi-language copy-paste detector on GitHub (5k+ stars):
| Capability | jscpd | PyChase | Why It Matters |
|---|---|---|---|
| Analysis depth | Token-based (shallow) | AST-based (structural) | PyChase finds clones even after renaming all variables — jscpd can't |
| Python support | Generic tokenizer | Dedicated Python AST parser | PyChase understands Python semantics |
| Installation | Node.js required | pip install pychase |
No runtime dependency hell |
| Structural similarity | ❌ Different AST = different tokens | ✅ Normalized AST = same structure | PyChase sees calculate_total_price and compute_discounted_price as similar |
| Clone grouping | Unstructured pair list | Connected-component clustering | PyChase groups 4+ related clones into one report |
Head-to-Head with SonarQube Python
SonarQube is the enterprise standard but comes with heavy infrastructure demands:
| Capability | SonarQube Python | PyChase | Why It Matters |
|---|---|---|---|
| Setup | Database + server + scanner | One pip install | PyChase runs in 2 seconds, SonarQube takes hours to set up |
| Clone detection | Basic, heuristic-based | AST structural fingerprints | PyChase finds clones SonarQube misses |
| False positive control | Limited | Threshold + min-lines + min-nodes | Fine-grained tuning |
| Portability | Tied to SonarQube server | Standalone CLI, any CI | Works in GitHub Actions, GitLab CI, pre-commit hooks |
Why PyChase Wins
PyChase is the only tool that combines:
- ✅ Python AST-level structural analysis — not tokens, not lines, not generic — full Python AST with normalization
- ✅ MinHash + LSH — near-linear scaling to 100k+ units, not O(n²) brute-force
- ✅ Three clone types — exact (Type-1), renamed (Type-2), and modified (Type-3)
- ✅ Multi-granularity — functions + methods + classes + files in one pass
- ✅ Interactive HTML reports — collapsible groups, syntax-highlighted code previews
- ✅ Zero external dependencies — pure Python, one
pip install, no JVM, no Node, no database - ✅ CI-ready — JSON for dashboards, CSV for spreadsheets, HTML for teams, text for terminals
- ✅ Python-first — not a multi-language tool bolted on, built specifically for Python AST
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pychase-0.1.0.tar.gz.
File metadata
- Download URL: pychase-0.1.0.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e46942204507749176e85709e4b279c55639a504f1acb52b4a992ba1f6bce1b9
|
|
| MD5 |
6a17f1fbcb027b10ca9822f2fa75c42c
|
|
| BLAKE2b-256 |
9d4a823050621f02441a2809b0d48692cc303f2331a65ae1d40172dd09a72920
|
Provenance
The following attestation bundles were made for pychase-0.1.0.tar.gz:
Publisher:
workflow.yml on Mayne-X/PyChase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pychase-0.1.0.tar.gz -
Subject digest:
e46942204507749176e85709e4b279c55639a504f1acb52b4a992ba1f6bce1b9 - Sigstore transparency entry: 1855842337
- Sigstore integration time:
-
Permalink:
Mayne-X/PyChase@aa2548e675679dc3817c6329e13656ba782c7cf7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Mayne-X
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@aa2548e675679dc3817c6329e13656ba782c7cf7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pychase-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pychase-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
434ecc76c9ed1091c67a6e533dd0e97952bded12691d86c6a460e507231c7e7c
|
|
| MD5 |
d17c1d06b4def161dbc9fd473cc841b1
|
|
| BLAKE2b-256 |
fc60d668af3f1a7d68ef8352bd391bcc8660d9ddfa3e109460c36dafdd38868c
|
Provenance
The following attestation bundles were made for pychase-0.1.0-py3-none-any.whl:
Publisher:
workflow.yml on Mayne-X/PyChase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pychase-0.1.0-py3-none-any.whl -
Subject digest:
434ecc76c9ed1091c67a6e533dd0e97952bded12691d86c6a460e507231c7e7c - Sigstore transparency entry: 1855842442
- Sigstore integration time:
-
Permalink:
Mayne-X/PyChase@aa2548e675679dc3817c6329e13656ba782c7cf7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Mayne-X
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@aa2548e675679dc3817c6329e13656ba782c7cf7 -
Trigger Event:
release
-
Statement type: