Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.5.3-cp313-cp313-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.13Windows x86-64

polydup-0.5.3-cp313-cp313-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

polydup-0.5.3-cp313-cp313-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

polydup-0.5.3-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.5.3-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.5.3-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.5.3-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.5.3-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.5.3-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.5.3-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.5.3-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.5.3-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.5.3-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.5.3-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.5.3-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.5.3-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.5.3-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.5.3-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.5.3-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.3-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 47c4e5815f78ad4cf86e847d3ca82f12ceaca3e10db0d96a6819e2b1c2dad5f1
MD5 29a41f1df7fc71c11cb5ead985cbbcff
BLAKE2b-256 d39b915b20d63067c59f2c6e6473933312850310e203cc33965843f1303b6b3f

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 53ecf80c2e51357cf72cc0d10fa945b1d5bbb981c292a6ae70442d8c3545e01e
MD5 01a1027bfe56c5099e94886895c5edf7
BLAKE2b-256 189930460dbf81609011243640ae1a3b0430082ec4a81df4f7d31baeea4109c9

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 98b36e6c9063d7b21e31c52f682a64683ad6ef579d536f274b0288378e045830
MD5 fa816c2b08f4498c45691e75254ec62f
BLAKE2b-256 b00f27f23270e03c01c63098ad1ab59758a8fc9418bc0d4077e22a8bf3efa2d5

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 9eccbb7edc27206ed3e1075882657f35ef4e12c34f35246066972f812438b40e
MD5 2a5ab5c28cc589d8b5ed34e5c40607c2
BLAKE2b-256 3c9df626a2706c9ccb11b3e4ee36a45aeed0fb38fd9f4f51ecace67fbc28f94a

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a20e6f6ee0d8de5aca68a316e23593f57888ccf99d953f0d1767af0a0ce108a7
MD5 d89eebe286c549507e8a9b5411ed09dc
BLAKE2b-256 d2a46698e99f8d0f69c17957596740cd68c9053e5eb9bf2008baaebd15bc8adc

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8ba5afced4778bcdf296d8286980d0b4b90f424745f56692e524fbda4820f6f8
MD5 5e6f55cce3c3aeba13d65a05cbe7cfd4
BLAKE2b-256 25c9d9a9f4ca1d4d0bf6d58ae98bebc026957ba543842a690f84187ff7625b69

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 99d255645f4a961a20a34eece01880d6b39368006f8e6a1bbed4f3d15ae04c68
MD5 29aa3ae770a8871cae0f8eb82d069730
BLAKE2b-256 110ce89dbd36fe5b32f8cb5c3a98bf2b63b1d9f1fe90d03fe77c5c7389461e06

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 842edcdec8ceec027d8ecabed23aaf5671ebec61e966f503099d449202ee6b2b
MD5 94930ff2d8f2b6329fb2b7068e9404bb
BLAKE2b-256 980f11f58fcb3a785f2d09fad7a3b2801b67c4b18019f0c92290a984dc071482

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4ba01d165a6399448f6745c27ffce08b847f71aeeca9da4716bb2b622a58e28e
MD5 fac95fee05cdde3c7c1ea2dd9450b9e9
BLAKE2b-256 4641d7a44948bc948ab20c229b872a9702bc9be30a2fa1d7752ce6e669053907

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7d723ae20e97f6b7b9a91dfef9a3fb52f0b668991fb9085a4b5809b1d1c91adf
MD5 bc605bd2daa0b8b6869a6ff0cd6b67c7
BLAKE2b-256 7e9c7e5659ca46b8ef339442c115c8f1ed74f26a6b904ada3b68916cd20b819a

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 be11859ce14983fba59f0944bc831779245b69030d4dea7f09ebeb47ff7d9dec
MD5 cb93b5d857e99da87eb6a11a78bd9020
BLAKE2b-256 9829f5ea7b9b7316a0b7e203f709d85ff961e13c6bf88359ba47b68f1f2af152

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c20e4448d01799a8bfa13af6641659c958210343049c648ce3c5344fb815c87c
MD5 3406ae6f0f3c7287242da60b39026a5b
BLAKE2b-256 1bbe1542d4e30bfd35a2cfcd24b062453f9e5147348ae9f1856f85445eb09ae2

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 3b433192e9a32ae4de7c24b214d28351361a396b588966b9eb09ebe778ed8b32
MD5 20ce24ae50826f821945f1eb579db794
BLAKE2b-256 7a6e5950825bb325d3c6c7c060bddbdfc9062efeec5c391c3d81e98095bc299c

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 97356854bd5bd6f164305fd0f7081a47e2400fd9c00e8bfce01371836a04a182
MD5 7795bf520a1c6cc087d6ab7b306392a2
BLAKE2b-256 a5f62c5a93e4ae727ae8081e2c01d55408fff65e8fe299e24ff8de0bea70910b

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 508de36dab39ccb75a33f121f587609213095597398757900ed9725c0ae44b87
MD5 6110ee571e5701a99087b86ef7d7d50e
BLAKE2b-256 8e1254587f55e67cd4102fdbda7324fe2d02369eb2ded21c63461b57ce976698

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 aa3d2a818925d39de10f99ac683e607eac0363d0e88d7c48f469859c4c655cba
MD5 5314e74dd321d87fe73e3f3db49adedb
BLAKE2b-256 129cfe51a5417f35f9c0f0b2c4f72898727f0aee7a9a12737c9aa06941b7f0fd

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 21e381b758ccdb729aa5f9355f5233d776c867f188c8e1d592b6ffea896d5892
MD5 eae427e908f05eb89ebb8cb178865d86
BLAKE2b-256 178059a1faa19506db1a2b5140d9ee984aba68a6f66776a4733b5e4b5fc62b5c

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9642f5dac6377b7e1aa1fffea1569b88a31b1356c71a7f83c1def12dc40e8ec2
MD5 b8caf1d6ed30a175fda9194150d926f9
BLAKE2b-256 7a44ed35c294c75e130109f9d1a7d995684dce8a8a4993d4bf4540f6aa5017c6

See more details on using hashes here.

File details

Details for the file polydup-0.5.3-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.3-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5420b48b4369bfcb113a5f6ec1e7fc9c5dc8004ba27c8086cc4f543dcdf5a967
MD5 e8b78ee2fa88c187b3190198e6df759f
BLAKE2b-256 46901895e9d87da75e57deadde47ea2231953e812277f5a3457d96ef1cf02e93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page