Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.5.2-cp313-cp313-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.13Windows x86-64

polydup-0.5.2-cp313-cp313-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

polydup-0.5.2-cp313-cp313-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

polydup-0.5.2-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.5.2-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.5.2-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.5.2-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.5.2-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.5.2-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.5.2-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.5.2-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.5.2-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.5.2-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.5.2-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.5.2-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.5.2-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.5.2-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.5.2-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.5.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 ddbb9be9c3f23971a9160ddb4cd89667bff9c2c3b100fb48bbe58a95e04a6722
MD5 992df961b832c02d65aa2c06890a9ffb
BLAKE2b-256 d2209c0cf729d2f8aeb4d0a7eeff044e21dc7f6d447f9f8f041919784b99be16

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7923efb88ce61cdd7d699265a515d7ff0fde9738f5fc09c419b0449b25d3387a
MD5 6c3bfcbd6dc73bd8fb6e460d9686e450
BLAKE2b-256 4197b56632386978386285970b61f90670afd9ff7715c5710739e42bd67fec16

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 150e84ac890a990b10625769b154ab621502befdb02ba692576983c1b41217e6
MD5 8b961ee3753d3ae74312715fe8ca5314
BLAKE2b-256 7c926c222c2bfeaf4e3dc903c169f572238b88fd06f220fd344f3ebe7a359fb1

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0e1a9c52446e2a254ca330230458a543efb224335e12728fa2524aae2697768b
MD5 faf3b00271a8f3cb1f4f18e0f3561448
BLAKE2b-256 f44a82ab529bacccfbbbbc281f3d646dfb9abca4c7c824806b2272e6b7bfa138

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8557154f341b5498bebfa2dc25a7f784f2802a6afb87cee1ffb6ab331153dd03
MD5 cc13a6925e05faa4c68876f36a03e615
BLAKE2b-256 7bc0b295ade176a60bac6f8724853eb89f0be34192e4dc23ddc6dd6601fcc00c

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 69cad50f6bdc91661315748ccf9c2a824300df78944479cbe9a1e7f04880646a
MD5 8723d7cfd8ff62f8d8c914fda42278b9
BLAKE2b-256 5cf57af83dd6810cc698579ab299a623e3f9f59edc3362de22d3976bd2925b45

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 428841a6be419c4e5bb5ab5a24d629512ad7002e3bc551383c64cd451eeb8f99
MD5 e4c7701ef3b1c388bf0e1208f94ccd8e
BLAKE2b-256 ec8909c408bb88a8f959892563889a7a78cfb13266c765eba6a1e564f7d4cc34

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a0b827f2637eda937873c7210dcb4b2d30531ca9554bef38d5506e22bdccf5f9
MD5 16054d3885f9392ac5feb0c61edb0e3a
BLAKE2b-256 c630c332900744ca2db2f40cc2f728e9f7e66bab8f6074835087322532bd619c

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 44ab784b2bdd50e4b01cb97d56b4bf29f140f5e8877cf488c95b7f20bfd6b4f3
MD5 d9c60c1791ac8662ed967d8f3b39d776
BLAKE2b-256 d2557b9da323c3e67348fc8dd466d8a83558e93bf0b947f04236656bb9dbd8c2

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d0f05b2b55057961cfa890f237d85fe1399dece0c3f4a3107aba52f387dcd0af
MD5 1c9f2bed1aabe7b5f96cd70074acaa16
BLAKE2b-256 706ebf6a3b4b2bfb7c6b8edd25ef9204e1979d8cfd7669aa67a29b1c9f8cb6c8

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d13b0133c6fed8a7a456f0ec5821247b6f611e1ce1488b047a0fcc443a602e83
MD5 4f22e08eec45299f020977ed01148b6f
BLAKE2b-256 e22845944a3f45b474add862943ac312883539a37295cdb6b5e1304b710a187f

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c8a3ef6f34ca031c44ecaef8cda9217ea4b795fd881ae2d805d50060bc25c002
MD5 a2f8cbde987042c68b376c6e766df106
BLAKE2b-256 ddfa87d98139c2ce9fa0ac4a82946619def5fabeda251d3536bd4e3c0def388f

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 86d6fa4ee878469d445ce5c9028d135ffa0484397de246b1eca92b04858c2328
MD5 3118351f527a37bad2bb2a678da7b5dd
BLAKE2b-256 4bdc8d0ad019e6ab3a1f332236e9c096e19fd9198103b64eedfb4695d7838909

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 82cb76495d6012683e9f0b762bd214f5989256244b8a00e8604485d8e8c16171
MD5 ab33dcafc30d1ec18a3d24f364ad7841
BLAKE2b-256 fc9e0262aa90c7344a228516a4bb8fb374a4ddd1a4a918e7768eaef36ae50a55

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ebe1c9aced1318cd22a1530fa430f4ab1b3a2c5564f7c77e76e4471dcf3f232e
MD5 43be5090580306fab91eaba651cb098f
BLAKE2b-256 29bf3fd3736a2f1f9a285dbdb84bdfe4c125048656b230d2f3aa2a3747732bc5

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 01a34e77654197e3f963c216ec3545e3a29c245d4e7da25a5dfe8768a6b4ddb8
MD5 01de74fbcf3b358dcd165b192f3471d4
BLAKE2b-256 d05b359d9be4fc2eed13acc6576dfb9b7fe111c999e72a3087b94fca508593ed

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 16a8f40c536b808584b25408bb707deab14162e66a7034f40b3392c9ba70a768
MD5 16b2af7735a10f5cd4c026e6bc06f395
BLAKE2b-256 19559c67e26b90d6e820fdb8b8ad7b50bec93672f7a29c0a431c85f0c5db5020

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c4d84b2c265c40e5287a7cd492cbcd3a495de67448b16dae4c768c411215f575
MD5 3c39988f7cc3afc7edfa7a73d12c1b47
BLAKE2b-256 6a2ab98b81fcfb6c4508a5a160d6d87ca579f3e9dc470bd459afa1e699c1d276

See more details on using hashes here.

File details

Details for the file polydup-0.5.2-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.2-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 75aadc0fa5763b9a8cdf1c2920d4d71f9c4e324a0b2819df1da654c6673c5c4e
MD5 8bd8229f08ddf006c7d26a582200f290
BLAKE2b-256 84c0e24b5cf74843983f819c15b29bd95482a1cb027de736a9a5dbe34ff806ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page