Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.5.0-cp313-cp313-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.13Windows x86-64

polydup-0.5.0-cp313-cp313-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

polydup-0.5.0-cp313-cp313-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

polydup-0.5.0-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.5.0-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.5.0-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.5.0-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.5.0-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.5.0-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.5.0-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.5.0-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.5.0-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.5.0-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.5.0-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.5.0-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.5.0-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.5.0-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.5.0-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.5.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 abc1620b7d9b2cd6551dc4868399f018b7f8c50c8b87ad333532df43de826d7a
MD5 0a098cb56edeb72ef54ba73cdd5dc48b
BLAKE2b-256 c8c80cba0ccb330228063fa731de235a2b2eebc47ed907cd2de930a21deed346

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f3fb4065c8365e4abbe9578acb3b699e8f6ee182f67327e5f1ea3575983ae099
MD5 7248e2c8571d0e25ce93ec09e44fc6b9
BLAKE2b-256 f4027764b0ca950f85c75ebcf7290e3abf5d6c02d05fe621583dce70a5b25fbb

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a75e59b32e9f35486fc49c8d7d74baaa26b72d4a53af8e7d462b6f7ceb349cd3
MD5 dc0444b63219427d8a067259bb2fc94c
BLAKE2b-256 4351c2d4964160bca445e1b178762e07a287f06e27689829b77fe5e6cb63bb2a

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ee8a86cc29a52b8663bd94b23b626a015f9845c529ca9bbea300e0ad7b41a20b
MD5 60451c42f398e0031fd8ba258b4cca26
BLAKE2b-256 9f4b799ff46d34fc4ff0b89320cf51d3cb10b6680767bd4eaca36c5ae76b94f1

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aa69723942fd78cb7ed84234a5de8b0197ae0776292745ff58fb92ea18fc5f52
MD5 545dff75e87b9a764b02e1589ae010c2
BLAKE2b-256 1e0bd75658de12c81a6d0977b3c422690cbb74ae1c17abc02dfaf851fabf7c98

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 aeb887a1483ac3da200060808a3bf0aaaab34946a4daf4c4da80fe1118f19618
MD5 3bd781efc51f9983cb4c8a5901d3314f
BLAKE2b-256 6aaab083570016257a5009bc8ab6075aed00160007b6ce85da0d7f1de14fb6ae

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 979e60204782515f73056a4f549245886aff639e15ed39bc12c1f89dae45afdc
MD5 d975531567bd7c4740ff975b006379bd
BLAKE2b-256 a78b64cac848497d42620878dac4cd2a589c396aaa658391817145b8cefece9e

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ee2a41b6821de519d9427b2562abf4a401318359eee6407d75e267b0bb7c63b6
MD5 005f0abb798ed442cd24bb1d019e3fd1
BLAKE2b-256 c42debcfe08bfd7f443411b516d89c26d75c393662b19e981379fd0d5df579cb

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4f51d7130cb0c17367d45770294c594d75c57ebd419f19d1af235ba7487a29ef
MD5 1740557af735ee6b53c15c30efb9d9fe
BLAKE2b-256 c46e7fcb800c77ec5163c21273ccb762b4e74ddab9ae8703b0c4eb1e71b5d1d8

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c0015120756d6ecd90f5adb888dfdfb4938978da8a9008c185ddb330ad249454
MD5 91d4877a417bd5c2d1323db3e03ec7a5
BLAKE2b-256 1f34f82a25d05bccc63134c89d25d4efea1dff67b898c8e048d07f2208128f2d

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9c437f62273fa9560884508af727d49e4897cdfcaef3813249572011d1a1828
MD5 f9ea38afa33f1921e711799ba20d7e6c
BLAKE2b-256 f47e5f9c28c31c3cf1d9c2a09506d9abfcab519bf7ce996508dc08004242dbe2

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8b85494d2c45b7d5b9368ebc6fc886d2fff9bc862b264f02952c10bb64896f5d
MD5 32595a76ebd7824021529fe875c395f4
BLAKE2b-256 0228f12233ec7cd40b117df6b73621b1e24b44affeaba1a67eb028ae3da264a9

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 8d2fbb646411d39b237110995f5966c0c1bfdd53d73b6be88b2aaa27d3315fb1
MD5 7b1cc3f979169cfee9c392377b97169d
BLAKE2b-256 14a1be454dffeb8f1c4b3eceffc1d1676d842b8a908c163ef3664f835b50eea6

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5a771dc08b51323eccbfa75efe179ca78dd8c94cb646affa5d7347f6d4b78fa7
MD5 c41b37ca124da7622666b0908aa257ce
BLAKE2b-256 d42704b8feb0c48ca92805af255e7dd2e9635f40a0c062699c9926e9aea22d54

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 189c9938fee770487095def4eba5283525fd50f020ca189381a03dea464a0c3b
MD5 140aa0ca2f94716147b2a7b6935c3585
BLAKE2b-256 b92aa70153f493c7fea5523b293617ad5a8d09387ace23b79d510b7371cc583c

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.5.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.5.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 5c7601a9607f2abb4ba5352756ab18cd71044cc057598baa2b36b9a734c54879
MD5 d546d58aed4cf234073769c76e942599
BLAKE2b-256 b5dd0c98860b0aa77198d39c546fc2116600dc1c42c8efb89be780a39e238e1f

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8eae75b3888db1dc545e6a354119827f6027782092724c2c96b038ed2e0ea035
MD5 3a1bbe2ce77418232726aa5ad7f80879
BLAKE2b-256 ba226189bcd72439640e013cf5e09764efb0b1e3f2e0a1af5478e7d31d766274

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 54642f0b547eaf1bc0e143bd2ec92334c1cae8e3e87f89dd840e5252b26dc70a
MD5 e69c614c5a3704bb898c3b26028b2943
BLAKE2b-256 2abe1b1194900db9858f471c4fb7a316709af7420b9e93d34f0ac45807d1fbfb

See more details on using hashes here.

File details

Details for the file polydup-0.5.0-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.5.0-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 570a67b3a8b1956275fd588fbbe1ab7b476bcbdc499852294f32a8bf958ebfce
MD5 45de359d1857ac122e38bf6ecb08fa8a
BLAKE2b-256 95e9cbedc93f865d285efc5534f8abab0f582bb09ab89cb619b45fb32bef1d58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page