Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.4.0-cp312-cp312-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.4.0-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.4.0-cp311-cp311-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.4.0-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.4.0-cp310-cp310-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.4.0-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.4.0-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.4.0-cp39-cp39-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.4.0-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.4.0-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.4.0-cp38-cp38-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.4.0-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.4.0-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.4.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a068a9defd9f06758c5e6f8eea5d6e11ed8ea1863e1ac9e4cd97607395acf1c1
MD5 2dfd05966f19093836d0d1657ffa9e7b
BLAKE2b-256 3b51d0be421b2b65936a180fdc105c9597f12370ab00633bd19b917c6f280012

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6e5f5ffa21b2a23bdff2944bb993dd299810ad91c70b056b5a26f39be4735c0a
MD5 31aa2eb5f5840fee93444c0e7d4337dc
BLAKE2b-256 aa3d24b3ecc0aeeaa7fdf5eb532a68ad75328d3af2094e9952129f32f083eb73

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d01ce9e3b2cd85ce54d303567535dea0a6f1e881e65202d788d738e7d7be4abe
MD5 5a039d84fe0bcf90d47b0729dffd2075
BLAKE2b-256 5647d71e93ba61fc1998ca0b8bfcf32f72cb81837ccf76fce602080c95d90dd0

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8e32dc080abdf2dd1200c55f3778fb3a383e7b4c3e0c07a32da8ab2d658bfda3
MD5 4c255ce6ced63aa829a1532928ad9c03
BLAKE2b-256 780c903570ad5e29869e89ae6bedc4637cdac1bfb2584b1d774be9fd5f284625

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 98e85eaa6e2045f455c8aa1582712116a970b642bfbc29982798ce69dec61b7c
MD5 6ee4c4b4fbf0c5b9f942aafda6bcfa1d
BLAKE2b-256 413e3de9c831666657237cfb00e19eb03d3eff01eeb7ce2f1cffadd953aa67f9

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 db554545d41c0603f9ecbe8451223db25a1e806f7325dc074247bf42d67869a2
MD5 fe264041fa75db2fc0b8d1f529855d3f
BLAKE2b-256 384ad970a5a712e0b0db8595d7851f7f7244e32ef33beb1914431a9f2324080f

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 14e73a0e6d7f1358bb5b265c0b7ff638e60829d2f0121a1e2744da1ef68665cd
MD5 03c4cdaaaa477a691154be0a1921c124
BLAKE2b-256 dc8c76d1926a905d3889cb8fc6ffb3992ba53ef4f312e2290640191f3fccc503

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 968632bb8ae9d248d507cafcc0a3df84725e62275f208b7359027f67552e7466
MD5 31fd1793b8bfa87fec4048f3010ddee2
BLAKE2b-256 a3f2d18a351d44fc7479bef98e90ffc230613bde15a3c8c16becaed442256f7f

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 21f77aa7b2a2287fc6211ec1565ea4e102d909e43aa4ca8384701603e90ec4ac
MD5 1b172b51630f0786570433706052d481
BLAKE2b-256 70fcdac177ad7bff8d880407360b9046599d18493b7947f3f08267d6dfc038c7

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d7904600d1fddfbc6e92136e26335e4ae9ae04faadecbdd8cd489022fb800002
MD5 55fd9da8abd48453451c07ba8fc6c082
BLAKE2b-256 1069ca44640f0a8e83818736fccd9d9b102d26a414ca37ec52d143eb1dfd1a9c

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f912616fc81f8a7738ef6d395adb8a4e769d0c12618ee976a1403f69d9804d69
MD5 1cff7b374708e4fbbadfb49929cddf0b
BLAKE2b-256 c0ef4a95696169a65a5a6462d3e3ad1f9309cd41ce3280562aaac21140525f8d

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 432de8a2035cc9b299900db425a227a03369257804e71c86b2599bbacff63413
MD5 447e25e01ee5e257cde7c60bc6e72a3f
BLAKE2b-256 861d3d1fb7825354cfe557927031393bc89b3a3c02356c22b3dd13ab443346be

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 fc9b65fd85ff4d12e563ea332748acf5db5f1a83b52989d5e443904caad60e22
MD5 c07be59e541f017ce04b94732c7f672d
BLAKE2b-256 2df58538109f575d3406128302002f889a5a0ef220c631f4fe12438db962a271

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 95879db959102677ddb3d4434dbb80ed0f39594e0adee1de506664e3a989828f
MD5 98417e011b3d5cc0ffd13371614e8778
BLAKE2b-256 d42510a592d715695b8d1827ac753198af659efbbe833cc20361eb517e81d218

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 91d61aeae1ec47da0513bb535678f6f607493b6142e36e0d548112d4986a64e2
MD5 500a93728036c914d9219b828e31b618
BLAKE2b-256 0eef42c268e1068e1321bfb186f0beeff36e2155e210539bde1d83325aba163f

See more details on using hashes here.

File details

Details for the file polydup-0.4.0-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.0-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 73e5651014addef7cc9fef913432b6e338dfe1649d4f5df212df77d6a94ecbaf
MD5 de5cd7aeeaf260e9906db01ab2f8338b
BLAKE2b-256 d85dbdce4da462041a0633828f97f21b6e2c69c8d8388a240b8df73d93de6eb5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page