Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.5.5-cp313-cp313-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.13Windows x86-64

polydup-0.5.5-cp313-cp313-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

polydup-0.5.5-cp313-cp313-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

polydup-0.5.5-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.5.5-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.5.5-cp312-cp312-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.5.5-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.5.5-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.5.5-cp311-cp311-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.5.5-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.5.5-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.5.5-cp310-cp310-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.5.5-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.5.5-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.5.5-cp39-cp39-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.5.5-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.5.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.5.5-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.5.5-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.5.5-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.5-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 5c9c18054a964a89a22d2e7a4676fea6ee5883c391d6db2a374f4a747db5a699
MD5 7335758944cd79d8a8d2db445133d6b6
BLAKE2b-256 3e38dd16898ed77fe40e153a5c5a16e29fb98ccda7f964679d16c1e67c8ff321

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 08fcc2e260d50fc9459892bc7eff0c23b16a8262f63c6c1a5c7b36b12302479b
MD5 8bce226dd691fd6d4d9c1e9ee0a26fa0
BLAKE2b-256 0b910d2786aa04bffb58798bb11714fb28aaef4781e6574f115f1e3936e78a17

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 33b895fa56ad4f2401d815fe675dd9ac1ba5d40ca9577fbc568bfd9cbf6af4df
MD5 5dcd62afcc5bc0c63f69e622097ab85e
BLAKE2b-256 a00978b7b57979420365408c6a87dde68d69a1a64a7487a60102408a553532fd

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 efc4af547033e55e238ced837208e3a133ef6eb90694626a339cfc5cdb080ff5
MD5 b5bc7bbc7d29b6498377ba9f9f44c568
BLAKE2b-256 8e4397e6eb1a240d5f705dce4932bee523e2d41f0612e762b31edaa996073ac6

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 57d4c092899870eeb70ed021106695ed546bbdc54289e54624e6a1960fb51082
MD5 fcf90bad506540896581dc5e155b2519
BLAKE2b-256 a29645170fe8d2695b32e5138981be6a39f3e9934e15e9d0235dc52bb2d1a3ef

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0931f9012014981c1df0cf9c82dcc95b6ee51d20eebe025751bc78f833b67846
MD5 48bdfdff63e49e2b9e8dc19b7c3a43ed
BLAKE2b-256 577293c7b9b0ebf7bdeef6d4342a619d447379716735fdf38c80a68e39c91ea6

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c23b68aa140efb8f4904ce1f793324b45096ccfd37ff9aa06cf92b993ee8300b
MD5 4b11a5de1ebfcd442b0090ccb64f5f29
BLAKE2b-256 fae581daedf6b609a467e2ef58570cdf1d9dd9cfc668cb86101d75ec1834a01c

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f04e1ad47a2a49b0d59b6389c4bc98c43d3b9dbf6ff15fc520d2b6225274dc53
MD5 7e0db2c402a1c2129253ae924ec9e0df
BLAKE2b-256 96a7dd91db0691621021e166c021545a5a2b62746acaaf3677cd5bd4c0019eaa

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4548f5ec0fa45bee95b860bab9559b731b526abff2e11e341e815aaf471779ad
MD5 eb2d7c9bb261699a9bfd3c4c6053ca5d
BLAKE2b-256 6a116e74c1a39b479606f5581058cf452d943f82e2614b0b4d543fbfa6838f3f

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.5-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 80b09bf9ebbb13afaaa357bf443c0135e89040c36f981c5f2b5904b9a397bb41
MD5 090d55588e207e1a0733e181d2283c8b
BLAKE2b-256 abb836b7b3f3443064acee68119f6714070d2590c8c18490ca84deb00a6de66e

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8948779c5a685c4bd9fe1f26b496f87a9a45dd4feaedb0f1cefffe6d0d2a78d8
MD5 6b89bf072c0ed9d7862e868ad3bef009
BLAKE2b-256 3684b41ea38a7233f028bd77a22c94c2ef1eff83dbbaf1623fc4229c724b74d0

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2e92d6f04bbf2bd240160a4727dfa141b6b0218d24f6462358ec6ddb16ebdf64
MD5 52c0bce645e3754428f84f915089dcfc
BLAKE2b-256 9fa8b26fbd99d4cf0d5b6cbd88f91ea82b3393ac3f436b3f312dc58ca1db4b62

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.5-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0b6b67e0e337fbe00ed5b9d834229fadb7358d05bbeeaf4eb9559d5b92714696
MD5 d9ac4c652dd5743a806154631a535ef1
BLAKE2b-256 6a9353dc0d5bef1dafb982ff33e5151c0076234a3fec21d4000b9bb2852521cf

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e979737cb5d729797086f178fa22497e283bcf51d2cf0b0b8fdfb9f456b807a1
MD5 a85f970cd8e64c87f78aab9351361cbe
BLAKE2b-256 44044158ff9f7189f8c2c5bd143b9cf547c96974dccf471d869e7a3fdd083285

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a92939fdf3fc9127d46f6c8a777e3e17206f85b5bae671932c4a249b7132aa41
MD5 bc2b2a37507d1d04b5bc7d048e6e656f
BLAKE2b-256 9f48ad8350e0edad981aa715adebf67b50756166820de4b3206545289cc08521

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 5ac814ac60aabe79cb6d325c91215bc39bfc70a40de26fdb941ed573911ce1cc
MD5 e351064886a8258d56c5da723afc7222
BLAKE2b-256 511ca3ef8edd5da8d71050112c85178b1a683b8f076dae93277b5d2d2c7ae0b3

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 205bfdb02f57c24fc4f1110709ed3723d127f411a44beb64ff4b4156e1648eb9
MD5 078c2f9f29b249af6c87aec605e9ee91
BLAKE2b-256 9aca8ef310c9a5a496e596b3550f85d680b6a2acffd73e0e4b90c70d77c17539

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f3c4cc757e47135a7c44387417e3ce4b7334f614c5fca4277ff21fdd3d333691
MD5 e0b9255b084327889510cf08dc5a8f2f
BLAKE2b-256 73b47a00d99efd6bd6664b776da696f2a53f0d17013e9ff723aedb130ff3bdca

See more details on using hashes here.

File details

Details for the file polydup-0.5.5-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.5-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c70b20d2d767c5707579bae052b435ff557ba04bcb0534975f6c59c2f806ee55
MD5 3630054f960515e51af930137e937652
BLAKE2b-256 afa51b1f61e6e8016b60f4959b334c9e45a362983ded82dee16670c64fae7882

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page