Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.4.1-cp313-cp313-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.13Windows x86-64

polydup-0.4.1-cp313-cp313-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

polydup-0.4.1-cp313-cp313-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

polydup-0.4.1-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

polydup-0.4.1-cp312-cp312-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polydup-0.4.1-cp312-cp312-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

polydup-0.4.1-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.4.1-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.4.1-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.4.1-cp310-cp310-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.10Windows x86-64

polydup-0.4.1-cp310-cp310-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

polydup-0.4.1-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

polydup-0.4.1-cp39-cp39-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.9Windows x86-64

polydup-0.4.1-cp39-cp39-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

polydup-0.4.1-cp39-cp39-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

polydup-0.4.1-cp38-cp38-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.8Windows x86-64

polydup-0.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

polydup-0.4.1-cp38-cp38-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8macOS 11.0+ ARM64

polydup-0.4.1-cp38-cp38-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8macOS 10.12+ x86-64

File details

Details for the file polydup-0.4.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 7482e71833878f76cbb46cb84f39f5376c3544db47efc3f669a87319a9b9484e
MD5 f14cabdff7eff1595e8ce6c67571da94
BLAKE2b-256 5c3ab493bfaf005008cfd66c64d0761ccc5e3666e47b7f961fcca009354b1281

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9aef5aa72ee532c25a4e46ee5ad7a9b9a1fdc24a2d122a7c49e8ba91069f56db
MD5 e32a337addb133b976567c49e4457fd1
BLAKE2b-256 873ad98400ab7331e623310ac5f66d507148cf92fe113f8342a8ef975191f0b5

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9fe2d070d5ebace4abfdae5ef2318e9e649510fdb7a19de615eba72b2839f78b
MD5 0a6128ca703c82269d6c823fde07d8cf
BLAKE2b-256 60ebd91ae587c8578551a62794465458be31904d3405066943ef3d8d751b964b

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 78e9c83b7d3ce0250b782d76455c9f2418b13b82bd62915d01acb1cb2a278ce4
MD5 1b54d28b22c6de9cb1fe5b871edf3604
BLAKE2b-256 67fff4db4a23b1ffd978eb78ca08d8002907c156e7389288aa9e870bfe2fcab0

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c0ba30771caf03a66c84c442f2141ee74c7d278b5aed1fe4716866b962afa6d8
MD5 83a3ca46b1677fd2c6c6257ff2510e36
BLAKE2b-256 6e3e2c8fa9c8c38d79e2f328a7c31e41833ab519c5df40a1f7a667113b82084f

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1c9094eef9d4ba322061a909c278e35ce84530283b0761e86e02e765211dde6f
MD5 5357d95257b98f0743b8dce370060b14
BLAKE2b-256 ddabb9992297d0243b0d681c9663171542783baae1138acd08aeb11659329dcd

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7bcea70123645979f4cb66b57660b7dbaad286b1e88e12afc3d8fe466baf882f
MD5 abe2120fb9a6c530a5c254c776481fcc
BLAKE2b-256 dedee528fc9f32c1ced0f984593579bc68903a0c117f0e235d0a469ec8473001

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4a6674cfee319e92afda8535493ca826266178118e3b96232204d8de1a2da975
MD5 39dc9cb429a55eeece835d1ca9680155
BLAKE2b-256 e7c391f839a1ff4a1ba33db70cdf3da8d96703b0c62033709401003945f57650

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5eb58e9f332b5e1997ccbad8e795477a6619d8f3571c5ba129dead1b2b159ea2
MD5 69fe89b305b1156a29e5b4dc57fa2e96
BLAKE2b-256 12512e3df4222864e3e0c126b77ac6e8e3f0f833cbaa1266e8f6850774fd40ec

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bf6f5ebc7d383c215191fd726a746931185edc9513c3a0917c863d7d9a1f61a5
MD5 d47b382b1038519bffb928cc9a888e87
BLAKE2b-256 d4544069f02daf059a0195d4ab05003ee1fc57fd0954b914b17a3721843389a2

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d871dd7452999fde959a7e61518210fc18e91ac75ecbb5d19b5bac9af67956ef
MD5 49f19b5e99f1e737c55f809baca942ad
BLAKE2b-256 7a75f6a7fb3a76646297654be43dc252d13a48be78b9311ed12a309f225d3a90

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 daff7abd3dfad923fff45f37cebba696f28c74f345541cabd079425c1c00e724
MD5 36c408092c6ddc6af65ee0a2b02ec890
BLAKE2b-256 13659154598bcd09757511b875d1d458bdc18333afbc7c9c5fc9844b6f48cf28

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 aacef7645531cb5e73675dad86582c556455ae355de38e18898dbeedc11d7866
MD5 f99ebe536c55e7ab89b52c83e3228ce6
BLAKE2b-256 1d8eb3c7f14c9a3c3a6b633e31be5112ed6f2903999b5db1ead002e398448615

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6057f402d1f15ab49008daa64fdc2745d902497d6351b86a9e8086994308fcea
MD5 409108a7c5c7d51756d2aa71f7a0a2aa
BLAKE2b-256 abddc81d9f06b2e39cde40951234e6fbee58308577889de4bc8b0428329cd72d

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1b9fea20626e4ba0583659a02600de97c948af1e382957c53154c2c4fe7776c1
MD5 a48793c7eaf0a424960cbf09ac3c8f4c
BLAKE2b-256 696806e59287815346b82b912ce0d639f48262de8f7af2b1f1ed0b069b23f0ff

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: polydup-0.4.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.4.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 8815a7f7d1383f49eb9afb38ab58ea9528aba0303388bd5bc568b2a7c2026485
MD5 0ec786cbfc12e5b0e93bf1b9ebca2f50
BLAKE2b-256 a71a04d90d34cc0910150d1df3dc3501298227327bfa232f87896d15044d879f

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 501de1640b8da717f4df1fbb6b7d5565b3f21c6c7a5782068981b423c3239dbc
MD5 f43be23c58574c4807838a2d46b22ec2
BLAKE2b-256 4868a98c7288b838de43449556b0873b5e7302ed08ff05a5091c0d9c657d72a2

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f14e1b3282551c7243c1a396ddc59069facbeb56f000211c3ab4ededf83a5740
MD5 d270c742a7608b2a69532061ca2d9854
BLAKE2b-256 bb554e046868897801bd5245953f2241cee10d609b0e99db41df5b04c8d9cdcb

See more details on using hashes here.

File details

Details for the file polydup-0.4.1-cp38-cp38-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.4.1-cp38-cp38-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3ca7b0b73c0bd2db6ad4ee9e84b633b7e183e5b23ccfd25222bee08bbfbaefee
MD5 fab67a4eb8afc36f676a1dec1afd32f1
BLAKE2b-256 1a7381d688ba686dd9cede211471db64e1f62e6a739bb7c854cd4f67ce1ab113

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page