Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.5.1-cp313-cp313-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.13Windows x86-64

polydup-0.5.1-cp313-cp313-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

polydup-0.5.1-cp313-cp313-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

polydup-0.5.1-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.5.1-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.5.1-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.5.1-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.5.1-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.5.1-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.5.1-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.5.1-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.5.1-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.5.1-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.5.1-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.5.1-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.5.1-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.5.1-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.5.1-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.5.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 2dfea2c9991b763297a5b7f0cb41e1bbdc540665f108ad2c334772d6ac0b96f3
MD5 f07923fbcd1aaf977cb76667a490b327
BLAKE2b-256 456e2afe16fa534d84e2d9589c30860730075587cc99c6fa251817698ae8a4aa

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4e533fc5956c08ecd445ab82b2d9d5b5fddddd82f00f37c8c7f4daad14601aa4
MD5 ca13cc8f2a424c79468e07ca7c2d094e
BLAKE2b-256 7261e3b3cba4d4358b162cf560d6fdf9de05e427543deed9b0a269d648f23f91

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7af1019c3a83d10d0648d5ea0e1e055197b2a506d0dd47f0e59c4f869afb5dc9
MD5 bc813892eadc83dbda8bd5573e27bbbf
BLAKE2b-256 38d8ebc4c62fbe90694a4e30101ee14cf78d16293be4a216e87271e0408a9e8c

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6de8cf98704f9ba060ca068ae69039019d494f307d607c92bca14a45049220ed
MD5 761c3cc7ae82887cd209428d32870cba
BLAKE2b-256 04e1900c396c955f52f8c83bb820b26d5720d77ebdf1ca561da4c3fc136e6d87

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4f1a2088d98a3ede7627a80c70f309e05166c62c0332c505bb9463d809611022
MD5 570a9982fa7df84f39f761fa933ed55b
BLAKE2b-256 366e024ee4fc1634cd2830286288ba94abc9a3e223df3ce0eec2a8c30dd9e5a2

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5618d2456dc879ddc87a949b91e095a12e6c1071e8e4431c49ee1f7dbb93a9ae
MD5 82df53e8c71626f79d178e1dcf85fc24
BLAKE2b-256 fce38ba060f7cb98bcc6d3bf52a34853c44e87445e573c84e1b9577f8c0ae273

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 bd889d09776b5c6797e2fc9e7bed6bfdeab3490b10be2371d2a5c69efdea259d
MD5 4a1fe187ab5a74627cd6dccde3408cfc
BLAKE2b-256 89150ed5c38fba0719e8cd4f1d910c65b5735cbb867368d7008ca6ed1747bf61

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ebc7ba798fc8997672b995fc46cb8d3f4e0b6591ec4d2ce1d1e80f25283f20ac
MD5 234e537bb5ed349f895e96b4d0644077
BLAKE2b-256 c9e86a8480c9d00fff0f5a077502cd01c51ea1294ae19d26df144b1c0688df1c

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 eae502fce72b578809558c8758127d39e0f5db8e8270b59873e431d9a6fded2e
MD5 ed04a27872055c76016651d7805c31d3
BLAKE2b-256 6bf2388fa8813f6fd2258b641a9ef85a37dd09ca24fade60f92a4250b32586fd

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bc620da90d1f26dbc2bc2a9f2bb086ae96fc1061723c79d7c71f38e493d13089
MD5 b4c8885fdc2ef1cb6ef9c7da3e771917
BLAKE2b-256 e4d632f1dd7758007d036c0a58e62ebb4ededafb5c93c21bcc3ad20c3ed00c65

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7e0d6a3e13414f5bddabd29a27f91b68de59bc9cb939773053f1aa161a6c6026
MD5 b3672c814d5fcb5b95245dccb4fc3cdf
BLAKE2b-256 1b741d661780664753eb610d0c90ceac0972d147ef39256fe9782a8b45e9b124

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5a48a8194bb98f950a520c9c480cc34e1c129f82c2594edd062fe046d861b188
MD5 c8d23235526b37f8403faf4be25e33fc
BLAKE2b-256 dc2fb4686e73e80e56ad572711a3c2f681fc09e78facfd02384ee36732c2007a

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 ad99ad248f4029f9dfb3f4d7a7c5d456e7ee9811fd9e83ac3559697bb0e1a889
MD5 c1814214fffb05e5a047a4cb77e31b48
BLAKE2b-256 fda8ba892ac5f345b324f20eead93a5464251f3111d6ec7b8aa9ef4eb4e2adbe

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 413e5faf2a70f041b54fb5f30a84aa6bb7421c07e185ab89e8c9f399e90b8a29
MD5 111b7f3a7875e47e903c353d2d5f187e
BLAKE2b-256 d36832e6f8b1d8f483628bb8ef858e4decab50c8bd502f76eab0354bcae7aabd

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b44558e7996cc82a4c92308191050530b6b731c53a517c2e650edd22ddd1f132
MD5 a8f98a484b908d9fa3ca73efc1d65779
BLAKE2b-256 03c39be809f0c612a257f1cd155cd7fcc0686b9003bd86454702d8eaa644b307

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 ba1125bef1e2b202dd4f8f213109a267b5c8a88fc13e3b71b13cc11475879be4
MD5 56eb34fdeb9348079399ecad08d642c5
BLAKE2b-256 8387080cecae9896be1038d2c58d984ab820161727599f15908cdbd6c936b389

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 32c1d4d47a90adf5a387317467061b4e171dc723e98e4c2db275476c51e9c60d
MD5 234484277e90c831731debe666bde211
BLAKE2b-256 25af5f72eaed441c8f2b2a5685815281855799a4bed5d664c08bdee78e73ab72

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 266c2d7d72fa3531abc119b32ec1de5072a324ed989c067370770ad579a7c948
MD5 07da710afd00a9b99d39738e58b98119
BLAKE2b-256 82bfa6a5c8b71c84e7d2af842b240eafe97975fce09606a3879472de16c5f225

See more details on using hashes here.

File details

Details for the file polydup-0.5.1-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.1-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 fae2fb3ad45cb0b0ecff42e59840e347201f75b07e1d8ddc17c290f744e9c2ec
MD5 4e5871484df33eaa02a57877daa76acb
BLAKE2b-256 3c8f2ad18d9d7e3142c97b190f9a4e8e8c841235aa5534c6b50bf773d149f6af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page