Skip to main content

Cross-language duplicate code detector - Python library bindings (NOT a CLI, use 'cargo install polydup' for CLI)

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.8.2-cp313-cp313-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.13Windows x86-64

polydup-0.8.2-cp313-cp313-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

polydup-0.8.2-cp313-cp313-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

polydup-0.8.2-cp312-cp312-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.8.2-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.8.2-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.8.2-cp311-cp311-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.8.2-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.8.2-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.8.2-cp310-cp310-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.8.2-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.8.2-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.8.2-cp39-cp39-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.8.2-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.8.2-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.8.2-cp38-cp38-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.8.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.8.2-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.8.2-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.8.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: polydup-0.8.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.8.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 97de5bfb60c35f7d480a5b7a7327a463d1d298ff1f3eb003b3ba6dd0ae031d5a
MD5 74a6fd63c50c171167d918c718ae5be5
BLAKE2b-256 498291598b3063c2ecc2f48066a2e4a93205b31a3db97fd83596a0db589d3eaa

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b85dae94804964eb76bbcfaf609eea364ec4b8953361ad5ee1bd4ce55674581d
MD5 7296c5348228791acd7e2a40202d506a
BLAKE2b-256 5b6862e44e39a6ce6459ef560baf09b32a2b96d1640e7c406f842b21eeb89483

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7bc8cd951d1546f1f6603ddd2fff39eb3fb34d4936d5292a6eefcbac68d95e16
MD5 fa6f7e3e29680baa6a8017d0b4e1daae
BLAKE2b-256 65bc679ee8685eab4a996d9764fddc1c93c0387d24e113473b04a3887910f979

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.8.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.8.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 8cc5456ecdbcd0343a83e5c19ee9190815e421a9f1da87f7ca1c80fdf2fc663b
MD5 a5f9d0a0af0b1bd58fb855ccf8a369e7
BLAKE2b-256 33001ad5c2d5a50d1a546a979bb09b727730a5e5f045f8e3e6cf6d940f674ddf

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6c873c5cce20033befe6c3ea56a64d3cf3656813394b1a25ba8546829d682ebd
MD5 e8fb564cae39e3370403931992276d3c
BLAKE2b-256 6b0f10938cdfe15e8e854c06485bc716cce5fe90826957484be11bdda3607819

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b8452a56baae9b6df140697296f32699a4e8fc8afacda6b7f4fb78b640344499
MD5 99b90d460dab79e53c90e08553f5e626
BLAKE2b-256 77e0492b9f8ba0e855d7345bc44886ef67e71368c757e693ecf5a386acbd2c1f

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.8.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.8.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 17ff617c9a116bb6e40005e5d90a83b778e735f8fd0eda4ead49bf52698c083e
MD5 0e8d6ae7647aba063100672de21af031
BLAKE2b-256 041f4e738b86aa2a1c9496c7db627e23f6e48ce0eed6e086ce2b727ae449eecd

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 789e71739aaff152a82ccd669a93c4316ef66785fe29760408379c948a59a60f
MD5 f947d01cc36f3ec92ff1f9f6ae681bea
BLAKE2b-256 cd979870b892a49c4d721326914533512ef3d6234b31dc65c8ef2ff39aa70d2e

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5d30cf9f51b6554ab3f71cadaced7de6dc0b5f2666a1c4a8e581226e0a3600a8
MD5 87ea65014b2af94725ef6e22f4f29973
BLAKE2b-256 e37a16bae66785f3b7de38e17a82dc4a33f702e3d792a55111b8b4113bc0b611

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.8.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.8.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 76ebf955a84b7087b4d05180147bc101ea6b2a12a3903146bec7670df25c73d1
MD5 64b5c2f438ef3acab8786c6ddf3caad1
BLAKE2b-256 6f06ce2b55d16dab65b2d868ce8405ea8151b98d175efecd4158247bbbea05d2

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 01305455dbe647376731a23a4c5cece33e1e40f127699d49ed8da45b049814c8
MD5 0fcc712ee57d5df4f4341aeec5118a5e
BLAKE2b-256 2752d3aece683ff6b19dea4c7005d8345576b6704d18c94b55660556efa1d6fd

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 60ec1debf555fd67bb2402a399f1066f4dd36b79022a296e3c48376d41996151
MD5 e834aa5dc15063b159d1e1adb53c6247
BLAKE2b-256 455a9a556b324c938ac264b782fa4e89b324f5ebcdd4a2a263e2ff574ccd6030

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.8.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.8.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 33dfeb161e63d9c8c4467ce140af9bb8e1330e8685b58d97e7c98f00439c3185
MD5 e904d8b166f85599b9f40b9dffe1551c
BLAKE2b-256 33ee60be80a0ff9451e2ba37c6a5a2bd57a3dba9b67d7c0d35c17524acc0654b

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7893dd90bc107b105a454bc8cff5f39fee3d894bc600f0bc6f13c090c44581fa
MD5 4e675d56165ae09100336ace92fc2dde
BLAKE2b-256 d810ed8702e08d9deb54bdb38ba932b3bc1d635c63202373b2bfe46e5fb28583

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6a3560c5dbcea48d891b347b8e2314088e256ab2dbe04cc7c69674834baf030f
MD5 a5b8f6a563d04d72a2bec405ab9e48cc
BLAKE2b-256 8cc2399b6adc5a734e2a27fcb0c8defafda066019aa3d40d38e3572f00b8dfd8

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.8.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.8.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 c8b2b11d322361dc13dd7bdf2ef7693ab6a56c06a6918e7eeaa16e0a54df2177
MD5 8500ff537f5c575fba24ffab32d73751
BLAKE2b-256 0da903beac80b13a649054a344d906b1d33cb5f621086b2bf0e79696f4b04510

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e681ea3950fb15ff6ef4cc748a3957792d1e332d0ffc3fc78aab7a9177fa3f09
MD5 049533b287284ab2d519e505d452ddd1
BLAKE2b-256 78ba09d911e0c1e191638f0f2570369766466ce47fb0bb4cd6551ec1bcd62803

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 25f8dc28cfeab2809f36b2665e0dff9d0dc76f1b407fd7070b1f8be60ae9534b
MD5 40cb3182ebf66c96dea463e4e59a9174
BLAKE2b-256 c68bdd210a2bcb2812c2cd9200eb788af2416e8869cf45f9fc628661a55da4b8

See more details on using hashes here.

File details

Details for the file polydup-0.8.2-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.8.2-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ed51c663ae0d67e0ae8ebf8be4be815f37b5da2945e0f931f01f4380b4bda6aa
MD5 1445ef447d573de81ad89a89f09cce74
BLAKE2b-256 4cb399c1b559e0d481d9b07a7d9ac971e2298a8b3f2d4f60308545b59023871b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page