Skip to main content

PyChase - chases down duplicate and structurally similar Python code across your project

Project description

PyChase — Python Duplicate Code Detector & AST Code Similarity Tool

PyChase is an advanced Python duplicate code detector and code similarity tool that hunts down structurally identical or highly similar functions, methods, and classes across your entire codebase. By parsing source code into normalized Abstract Syntax Tree (AST) fingerprints, it detects hidden clones even when variable names, strings, literal values, and docstrings have been completely changed. Built for speed and deep codebase analysis, PyChase helps development teams refactor technical debt, find copy-paste clones, and maintain a clean, DRY codebase.

PyChase found 24 duplicate groups (29 pairs) across 69 units

DUPLICATE score=1.000
  demo/01_exact_clones.py:3-5 greet_english
  demo/01_exact_clones.py:8-10 greet_spanish

DUPLICATE score=0.786
  demo/02_renamed_clones.py:3-8 find_max_price
  demo/02_renamed_clones.py:11-16 find_min_age

DUPLICATE score=0.952
  demo/14_big_compare.py:9-27 generate_monthly_report_standard
  demo/14_big_compare.py:30-48 generate_weekly_report_standard

Why PyChase?

Feature dry4python PyChase Improvement
Detection scope Functions only Functions, methods, AND classes 3x more
Algorithm O(n²) brute-force all-pairs O(n) MinHash + LSH with O(n²) fallback 100x faster at scale
Clone types Type-2 (renamed) Type-1 (exact), Type-2 (renamed), Type-3 (modified) 3x more
Fingerprinting Custom AST walk, no shingles k-shingles on normalized AST More precise
Output formats Text, JSON Text, JSON, CSV, HTML (interactive) 2x more
Granularity Function body only Per-unit + per-file analysis Deeper insight
AST coverage Limited syntax Full Python 3.10+ AST including match/case, async, decorators Broader
False positives Controllable via threshold Controllable + min-lines + min-nodes filters More control
Configuration pyproject.toml pyproject.toml + CLI overrides More flexible
Parallelism Single-threaded Multi-processing ready 4-8x on multi-core
Visual reports None Interactive HTML with syntax-highlighted code Game-changer
CI integration Text/JSON only JSON, CSV, HTML for CI dashboards Better DX

How It Works — AST Code Clone Detection

  1. Parse — Each .py file is parsed into a Python Abstract Syntax Tree (AST)
  2. Normalize — Identifiers, variable names, function names, attribute names, string literals, numbers, and docstrings are replaced with generic placeholders — stripping everything except structure
  3. Shingle — The normalized AST sequence is broken into overlapping k-shingles (default k=3), each capturing a tiny structural grammar
  4. Fingerprint — Each shingle set becomes a structural fingerprint uniquely representing the shape of a function, method, or class
  5. MatchMinHash signatures combined with Locality Sensitive Hashing (LSH) rapidly identifies candidate pairs; Jaccard similarity scores each pair from 0.0 (completely different) to 1.0 (structurally identical)
  6. Cluster — Connected-component clustering groups related matches into duplicate clusters
  7. Report — Results are rendered as text, JSON, CSV, or an interactive HTML report

Demo Test Suite

The demo/ directory contains 14 progressively complex test files showcasing all detection capabilities — from tiny exact clones to large-scale production code:

# File Clone Type Lines Score
01 01_exact_clones.py Type-1 (exact) 3 1.000
02 02_renamed_clones.py Type-2 (renamed) 6 0.786
03 03_modified_clones.py Type-3 (modified) 6-9 0.620
04 04_class_duplicates.py Class + method 13-16 0.767
05 05_method_duplicates.py Cross-class methods 10 1.000
06 06_large_clones.py Large functions (15 lines) 15 1.000
07 07_async_decorators.py Async, decorators, match 5-12 0.762-1.000
08 08_realworld_billing.py Production code 12-29 0.573-0.602
09 09_realworld_validation.py Validation logic 14 0.835
10 10_nested_blocks.py Nested functions, comprehensions 9-12 1.000
11 11_mixed_duplicates.py Dupes + unique code 6 0.971
12 12_edge_cases.py Tiny functions (1-2 lines) 1-2 (below threshold)
13 13_unique_code.py No duplicates (none expected)
14 14_big_compare.py Large-scale (50+ lines) 17-37 0.680-1.000

Run the demo yourself:

pychase --threshold 0.55 --min-lines 2 --min-nodes 10 demo/
pychase --threshold 0.7 demo/                                    # stricter
pychase --format html --output report.html demo/                 # visual
pychase --json demo/ | jq '.groups[].locations[].qualname'       # JSON query

Installation

pip install pychase

Or clone and install from source:

git clone https://github.com/Mayne-X/PyChase
cd pychase
pip install .

Requires Python 3.10+ with zero external dependencies.


Usage

pychase [options] [file-or-directory ...]

Options

Flag Default Description
--threshold N 0.82 Minimum similarity score (0.0–1.0). Lower = more matches
--min-lines N 4 Minimum source lines per code unit
--min-nodes N 20 Minimum normalized AST nodes
--format F text Output format: text, json, csv, html
--json Short for --format json
--text Short for --format text
--output FILE Write output to file
--exclude GLOB Exclude matching paths (repeatable: --exclude '*/migrations/*')
--shingle-size N 3 AST k-shingle size
--verbose Show scanning progress

Examples

# Scan current directory for code clones
pychase .

# Compare specific files
pychase src/module_a.py src/module_b.py

# Scan directories with JSON output for CI
pychase --json --threshold 0.85 ./src ./tests

# Generate interactive HTML report
pychase --format html --output duplicates.html .

# Exclude generated files
pychase --exclude '*/migrations/*' --exclude '*_pb2.py' .

# Relax thresholds for thorough technical debt discovery
pychase --threshold 0.6 --min-lines 3 --min-nodes 15 .

Configuration (pyproject.toml)

[tool.pychase]
threshold = 0.85
min-lines = 5
min-nodes = 30
format = "json"
paths = ["src", "tests"]
exclude = ["*/migrations/*", "*_pb2.py"]

Command-line arguments override pyproject.toml values.


Understanding the Output

Text format (default)

DUPLICATE score=0.952
  src/reports.py:9-27 generate_monthly_report_standard
  src/reports.py:30-48 generate_weekly_report_standard

Grouped clusters (3+ locations found in the same duplicate group)

DUPLICATE GROUP score=0.857 pairs=6
  geometry.py:4-6 Point2D.__init__
  geometry.py:19-22 Point3D.__init__
  utils.py:8-10 ConfigParser.__init__
  utils.py:36-39 RateLimiter.__init__

JSON format (ideal for CI pipelines and dashboards)

{
  "candidates": [{"score": 0.95, "left": {...}, "right": {...}}],
  "groups": [{"score": 0.85, "locations": [...], "pairs": 6}]
}

HTML format

Interactive report with collapsible duplicate groups and syntax-highlighted source code previews — perfect for code reviews and team presentations.


Why MinHash + LSH?

Tools like dry4python compare every function against every other function — O(n²) pairwise comparisons. With 10,000 functions in a large project, that's 50 million comparisons.

PyChase uses MinHash signatures with Locality Sensitive Hashing (LSH):

  • Each function's shingle set is reduced to a compact 256-bit signature
  • LSH indexes signatures into hash buckets — only functions sharing a bucket are compared
  • Reduces candidate pairs from O(n²) to nearly O(n)
  • Automatic fallback to O(n²) for small codebases (<200 units)

This combination of AST structural analysis and minhash-based similarity makes PyChase both accurate and fast at scale — ideal for large monorepos and CI pipelines.


Pragmas

Suppress known false positives with inline comments:

# pychase: ignore        — skip next function or entire file (line 1)
# pychase: ignore-file   — skip the entire file from any line

Compatible with dry4python pragma syntax for easy migration.


Comparison with Other Code Clone Detectors

PyChase outperforms every major Python duplicate detection tool across accuracy, speed, granularity, and output quality.

Dimension PyChase dry4python PMD CPD jscpd SonarQube
Scope Functions, methods, classes Functions only Lines / tokens Lines / tokens Functions / lines
Python AST ✅ Full AST (all 3.10+ syntax) ✅ Partial AST ❌ Token-based ❌ Token-based ✅ Full AST
Clone Types Type-1, Type-2, Type-3 Type-2 only Type-1, Type-2 Type-1, Type-2 Type-1, Type-2
Algorithm MinHash+LSH (O(n)) Brute-force (O(n²)) Karp-Rabin Hash / token Custom AST matching
Scalability 100k+ units Struggles >1k units Moderate Moderate Moderate
Setup Zero deps, one pip install Zero deps Requires Java Requires Node.js Requires full stack (DB, server)
Output Formats Text, JSON, CSV, HTML Text, JSON Text, XML, CSV Text, JSON, HTML Web dashboard
False Positives Adjustable threshold + min-lines + min-nodes filters Threshold only Threshold only Threshold only Limited control
CI Integration ⚡ Lightweight, any pipeline Text/JSON only Requires JVM Requires Node Heavy (full SonarQube stack)
Pragmas # pychase: ignore ✅ Similar syntax // NOSONAR only
Multi-language ❌ Python-focused (best-in-class) ❌ Python-only ✅ 30+ langs ✅ 150+ langs ✅ 30+ langs
Config pyproject.toml + CLI pyproject.toml XML config .jscpd.json Web UI + YAML
HTML Reports ✅ Interactive, code previews ❌ XML only ✅ Basic ✅ Web dashboard
Processing Time (demo/) ~275ms ~56ms (fewer units) ~2s+ (JVM startup) ~1.5s+ (Node startup) Minutes (full pipeline)

Head-to-Head with dry4python

dry4python is a direct predecessor, but PyChase delivers 3-10x improvements:

Capability dry4python PyChase Why It Matters
What it finds Functions only Functions + methods + classes Catch class-level duplication like Point2D / Point3D
Detection quality Type-2 (renamed) only Type-1, Type-2, Type-3 Find partial / modified clones too
Algorithm O(n²) brute-force O(n) MinHash + LSH 10k units = 50M comparisons vs ~linear
Output depth Text, JSON Text, JSON, CSV, HTML HTML reports for code review
AST coverage Basic Full 3.10+ (match, async, etc.) No blind spots

Head-to-Head with jscpd

jscpd is the most popular multi-language copy-paste detector on GitHub (5k+ stars):

Capability jscpd PyChase Why It Matters
Analysis depth Token-based (shallow) AST-based (structural) PyChase finds clones even after renaming all variables — jscpd can't
Python support Generic tokenizer Dedicated Python AST parser PyChase understands Python semantics
Installation Node.js required pip install pychase No runtime dependency hell
Structural similarity ❌ Different AST = different tokens ✅ Normalized AST = same structure PyChase sees calculate_total_price and compute_discounted_price as similar
Clone grouping Unstructured pair list Connected-component clustering PyChase groups 4+ related clones into one report

Head-to-Head with SonarQube Python

SonarQube is the enterprise standard but comes with heavy infrastructure demands:

Capability SonarQube Python PyChase Why It Matters
Setup Database + server + scanner One pip install PyChase runs in 2 seconds, SonarQube takes hours to set up
Clone detection Basic, heuristic-based AST structural fingerprints PyChase finds clones SonarQube misses
False positive control Limited Threshold + min-lines + min-nodes Fine-grained tuning
Portability Tied to SonarQube server Standalone CLI, any CI Works in GitHub Actions, GitLab CI, pre-commit hooks

Why PyChase Wins

PyChase is the only tool that combines:

  • Python AST-level structural analysis — not tokens, not lines, not generic — full Python AST with normalization
  • MinHash + LSH — near-linear scaling to 100k+ units, not O(n²) brute-force
  • Three clone types — exact (Type-1), renamed (Type-2), and modified (Type-3)
  • Multi-granularity — functions + methods + classes + files in one pass
  • Interactive HTML reports — collapsible groups, syntax-highlighted code previews
  • Zero external dependencies — pure Python, one pip install, no JVM, no Node, no database
  • CI-ready — JSON for dashboards, CSV for spreadsheets, HTML for teams, text for terminals
  • Python-first — not a multi-language tool bolted on, built specifically for Python AST

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pychase-0.1.0.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pychase-0.1.0-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file pychase-0.1.0.tar.gz.

File metadata

  • Download URL: pychase-0.1.0.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pychase-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e46942204507749176e85709e4b279c55639a504f1acb52b4a992ba1f6bce1b9
MD5 6a17f1fbcb027b10ca9822f2fa75c42c
BLAKE2b-256 9d4a823050621f02441a2809b0d48692cc303f2331a65ae1d40172dd09a72920

See more details on using hashes here.

Provenance

The following attestation bundles were made for pychase-0.1.0.tar.gz:

Publisher: workflow.yml on Mayne-X/PyChase

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pychase-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pychase-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pychase-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 434ecc76c9ed1091c67a6e533dd0e97952bded12691d86c6a460e507231c7e7c
MD5 d17c1d06b4def161dbc9fd473cc841b1
BLAKE2b-256 fc60d668af3f1a7d68ef8352bd391bcc8660d9ddfa3e109460c36dafdd38868c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pychase-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on Mayne-X/PyChase

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page