Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.5.4-cp313-cp313-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.13Windows x86-64

polydup-0.5.4-cp313-cp313-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

polydup-0.5.4-cp313-cp313-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

polydup-0.5.4-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.5.4-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.5.4-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.5.4-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.5.4-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.5.4-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.5.4-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.5.4-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.5.4-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.5.4-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.5.4-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.5.4-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.5.4-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.5.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.5.4-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.5.4-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.5.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 4c52e64a763afc0eaf515f2765b18cc31238c7fd5be2ee57f190f1236489d3ab
MD5 cf1fd3e8ddf4788e8c9d66559cc7d44c
BLAKE2b-256 1b8150b8e27a1c2217814a61835eafdb77027d5a7ea3ce9924ef1c4c4aac4105

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 07ce1269bf6e35f0b6895e7f08a459966fbcd2de0a0d0f18682df6e2bb58853d
MD5 08bafc10bd80f1fb4487f10519ec41a0
BLAKE2b-256 91dcee03ebd84dd93a381d01e606fb7c5242698916be3f161fc0da044f35d79f

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6158f2e331ee7f534cbd96a60282f6608a2e783e808631cc32493c74f54979d7
MD5 029cdf28f0fae6cdcf61a3cdebdf7bf2
BLAKE2b-256 b26c0ecebaa3472e564ccf32b1af65604c85da0f7a323b4ce063f2de7cfbf024

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 3dda31ba614980e65629a177f9e53a3d6f44e50f57bcc0709fe2f3b6afef479c
MD5 a5740770d44d83d177ffec35f96b079c
BLAKE2b-256 d7361ee6cc76e3c0d71dd5e9a330858dd18f692dd1b6c47186e915db532d5bab

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 87b52dd347bbe94ff532cc6c91fc74096be759ed54160e547f75f18dac4de908
MD5 8f54c1dde5f65cd6ee781ca05261353d
BLAKE2b-256 03b9c075e1b9662852512ede39456e37546fb2578d58d231f3962f6a4ad03e3f

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c978333bb74980273c411f59e4e89e356a32908858f5ef92ca65117e60713709
MD5 005c8d9f22a4bcc5b190992bd084ac02
BLAKE2b-256 834b340bc3672861cc516e65a53c93e7069dd5aeff1e81dee775c276212ce89d

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 97b29f4128a4475a303ffdfdedd50e6a479f957f5441e7084f74988b911c2fb0
MD5 c727be35b8a3be540f3fcd7d5ac9b5d1
BLAKE2b-256 fb804c7c1609ec5a409cb0e6426a15b0d70bddf164e5cbfddc4974614f655331

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0e642583add923ec30d264f8ff076dc15cd4464209c79317d0b7dc711509d1a5
MD5 3b2fb3c8e0a7589bd2bd1e67f8505b9e
BLAKE2b-256 5f8288cdc23f11bbd4a64771f26647a124dcee067fc4f5f83a815bd683757b3e

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e7744accfaf5cad67043db2f6b35748029dff673140a1e449f1724db6f196d3e
MD5 a1fc1a8e7d63df8808d6bb808499936e
BLAKE2b-256 e4aeb2e18ac522b1346aa877fa576bd2c8df4f81966232f5d326c229956d0828

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a7077b8e189a526d2e3f1c5006fe8bb05b26546a8746f3b83ad5684857837cc9
MD5 837e59f56294267c6108ec31d5e351ca
BLAKE2b-256 8c24923acea26be47281725a5e2aec09b4c619e3e2d2ac0b7db74c558a590738

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 36f6640b20c34b29bcdc510197c185f60edadb30ff6408242e5a51d145267f2b
MD5 44f24db265cba40ef9cd38c06865e0a8
BLAKE2b-256 b036d4f8c366239ac898be018a8e808fd9bd438a7993ec87df122ffad2bbda70

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 96549cee1c08c257639f587f32075752403960875ec7f2f74b01683d8c7c6b1c
MD5 11bcd8100230823f41fd776cd12b96d4
BLAKE2b-256 6df6d2078cd3af126f71dbdfb593d0ecb90519df534558d29df9159f324612c4

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 c211ee751cfbb46a16619b5216dc169708a1a0e27f5cbc49ca944772e70202a0
MD5 a0da47298c5be9065322b1f21e9b9e5c
BLAKE2b-256 fbe66952532b1ead5e5d24eb5a4f8d75f742c54af494c5feb58de8e1808d5c6a

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 97d2c20cc469a86335ab0aa0a55d7d7679342f8e51f3b990cb3e1457c69f8846
MD5 3267525a54559e556020838b97b58bc8
BLAKE2b-256 045153724e66c15315a73d8abc6c5b0bba4ea542aa7590a414996001c2cdce70

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7623a6c98c1f4e71eca539977e7c56ff3a7ae96adfc84165ce4cc57d9fdc6360
MD5 9f0cec7536346d1269cb3262fd99ad3c
BLAKE2b-256 9fa0d3bdc37807612be6fbd9b6403632b5042249c0dca2661b59d835a6a4cab9

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 c64cbef1807548d583596a862a33810eadff4ed0d257334c69a2e7f0df55745b
MD5 4169d0ea98723dc4786b806dc7989215
BLAKE2b-256 171dd28e6a30f40b56eb254140cdc7bdfd7c4ff7a60000443f7ed0d5a16a058c

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6361c9fa2146734040caccc82c17dd2c5c60c4adb854d192536d7ebd9c640610
MD5 dfcbaa8a3f46f2e04d8a4c80b77e88b0
BLAKE2b-256 7ed3e5f51ed4fdbb35adcc52b3c8f388e1e7ae290e3e4b7cf6c367b1e58feccc

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e630270aada35acd62031f9552c7b300104eb60ad3eff876d25f4e9363cc1776
MD5 0bb8409d7e69c0768b1a77d893014c93
BLAKE2b-256 4ff452422bff041256177cf97a2b7de5e116012df17c51d333d91c6e18bde851

See more details on using hashes here.

File details

Details for the file polydup-0.5.4-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.4-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9042a212ebdce6497bdfad3a0c6ba238a403458b595b54c5bec2077fb129e7d6
MD5 57a8cd3a7d072a1aa1c0389f475ba958
BLAKE2b-256 83ebbe16a207cde2353d840f9dcbcf61e2e3812e1f4ece239e7d19d1803113eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page