Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.3.3-cp312-cp312-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.3.3-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.3.3-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.3.3-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.3.3-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.3.3-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.3.3-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.3.3-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.3.3-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.3.3-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.3.3-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.3.3-cp38-cp38-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.3.3-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.3.3-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.3.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.3.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.3.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 bdffb0e1079d0232b6884ba19ab98d5bdbcce8d34aefbb40892bacc0b9a03993
MD5 242e831f0dfe1649538f246706b72aae
BLAKE2b-256 2be866a17432a91c3a5203ebdc296b7b302f6e1c427b5155f6a50c83b31da2b1

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0ed557dac5f88e2a5e162cdca538d93b209e323a0c7f03a392e909c13c4e0809
MD5 d9928f018ef2a982202f42efcd49f5aa
BLAKE2b-256 3075b3263321cc20dfd305c524598cc6b0da3e48581ae8054e4a01b29e256a57

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2cd2214032ebc4dd01f48a795feda195b46ce3bdaf812fcf60951e7d3d22f148
MD5 32abf337eba8b3b7c074a1d4641445d0
BLAKE2b-256 13d3f9072501beb29252d807719dd6651b1863f1609937e59eb5a15066c84949

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.3.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.3.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 34b9fa227438baf4b17cbd8698df44140c6f94b784e0097ed60f8058b72e1200
MD5 1ec4eac6aef4f3f12ba081b61427cea1
BLAKE2b-256 67b3cff77f63d3002a1de444aad64f5649e294329fa0029969369bfbf26b3a08

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 caffd12693dd267c42800033516ab492dc0d70a8b4ea7f294b16bf738ea3ac74
MD5 0daee2c789144f63064ecd6c4cd79664
BLAKE2b-256 e8a74e5644549223b17f490c4b7d0cd14668f3b536f10107686dbafb22ab460d

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a34306dced80b6329f125e530bbc26a8c954b6b17fda923c5fb439c00bb7749d
MD5 b4b097a1bd50b6d3aff61f0d345d8076
BLAKE2b-256 b413be1e8fe8aaa5fffdb1d8e7b12d2d8cd2d7c02827a367334fd66bd246befe

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.3.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.3.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 93edb134f2ff90f4d6a2f7b5da9a52185a0afb472a5a1cd737bf233e5a00b784
MD5 4f71daf96a7e55288c72ad8cfc681f09
BLAKE2b-256 bb2e29070d0148ed9a80e9ad51b82e8f6c7522400ef0c6ba8ba5123e3f9bdf0b

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 88a3a6f6419b4821495b0af63f9cdb96797b68c5506173e5b477e62de2edae40
MD5 bfd3879d2983ac44b0d0a81ad373614b
BLAKE2b-256 b1f45fe96cb8c2207b46473b3880a174c01ebac54dc626dd41dd65fc4c6da52e

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5bc6488f542e0006f6228d1e9a3edcba30d3c078e9b8c58b7fe14e59c1e8b3eb
MD5 032076263fdb4610d0f69753eb33884c
BLAKE2b-256 6738b6b5c076e5ded9b6eeb07593692c7eec96aca37dbd3600f539697a86426a

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.3.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.3.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 8e166104c088b7dca0a5cd2f84c0fd1fd0d5d2384ad4423f69e42d8806fc95f6
MD5 e8ce07a420fa1f930c49a449ae4e6902
BLAKE2b-256 8ebf3f50e9ee73092bc3e278477a9ec77a82bfff27b0a853c9f1711a46434ef0

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9177fba205d98e019db2273055507d43b29610d339faf6dcf43d35fb645bff4b
MD5 dfa71c2b5a82f812afc44aa2811f4633
BLAKE2b-256 13e920b2cfc41f168ac9a6a1e1661d553cb8032c31a62e40d1c127614721695b

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a67459d6fe4f959e82534ebfb86c4604bcb118aa35df5aa68e4e55bba47ec68e
MD5 39f38b7be585a9b7d9dfc71d187635e7
BLAKE2b-256 564b2ca03be49890ac1ad3e90c8a644def6c4937cf7b41e39621043b6f94976d

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.3.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.3.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 c61424bff1c6908ce9708f1165d2725bff28143f5d1412fea7ce47ec89770872
MD5 fae90b7ac392e2d4285fe8d1236548d5
BLAKE2b-256 4f163ba5bd26b6a12d275bf373d2af844f2273211a1eb81d35c3a274042a05d2

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5bc932ae7d4f7e53f7f1e080ee120488faeb4cdddd3e6e7fc710a33510f7605a
MD5 90963db014863ea835afe4aac3202887
BLAKE2b-256 8552d152cef5820fe56f152698e165ab5c26522ad92712b79385d55bbf79853c

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2e38713b217857f3cb6b2baa9cf8280adcbfba2f46feeeda0d4e9d81bd5f6165
MD5 72d35385880eb1d9d0c0c1941cc437a7
BLAKE2b-256 d42882fc14b7bdc5210d269718d2f7386ee5635e28698375295ef8cb09b065e9

See more details on using hashes here.

File details

Details for the file polydup-0.3.3-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.3.3-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 eb5b2d14b29dadee98617427a657bb9b4aa5d43549289bc9cd408bcd33686fe0
MD5 1c1a98ca34a739a17fbd497a607b4b7c
BLAKE2b-256 615643c08d370354ca6e9cd5764fad4e2f274bb1c0d4cb154f56492a162969e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page