Skip to main content

Cross-language duplicate code detector

Project description

PolyDup Python Bindings

Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance
  • Zero-copy architecture: Direct FFI to Rust core for minimal overhead

Installation

From Source (Development)

cd crates/polydup-py
maturin develop --release

From PyPI (Future)

pip install polydup

Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Concurrent Execution

Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.1.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Algorithm

PolyDup uses:

  • Tree-sitter for language-agnostic AST parsing
  • Token normalization for Type-2 clone detection (e.g., userId$$ID)
  • Rabin-Karp rolling hash with window size 50 for efficient similarity detection
  • Rayon for parallel processing across CPU cores

See architecture-research.md for detailed algorithm analysis.

Development

Build

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

Type Checking

pip install mypy
mypy test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polydup-0.1.3.tar.gz (52.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.1.3-cp311-cp311-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.11Windows x86-64

polydup-0.1.3-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

polydup-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

polydup-0.1.3-cp310-cp310-manylinux_2_34_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

polydup-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file polydup-0.1.3.tar.gz.

File metadata

  • Download URL: polydup-0.1.3.tar.gz
  • Upload date:
  • Size: 52.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5b2573919020b2a71cf9f430727693ffe28ea10d2a396dd78e83d3ee9415fed8
MD5 e79137d765bb9ac2482172438fb2a2c0
BLAKE2b-256 9997937097aa18b7773b89fab336e3b483da5b4c487eb733014403c5986f4f3f

See more details on using hashes here.

File details

Details for the file polydup-0.1.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: polydup-0.1.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.1.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 78f7ccc5d49dbe1b31ad16b43530084c6cde2327c7521a668387121464a408ef
MD5 10bcf39d1d6b3245481d63c7fbaa274b
BLAKE2b-256 2fa81289a41fd0a7f09dee7908411a1c01851b3710081e5fa7f1782dc143a1d4

See more details on using hashes here.

File details

Details for the file polydup-0.1.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.1.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7de597c62c51ede154c87a7a202c3e2ae24bf600ab44db2bb0e0c9e5ee991983
MD5 b3e873567b4b10a979656c9ec3b44837
BLAKE2b-256 74174d39e70407c35111c311181093e83c215cae11bded5640721e19818b2048

See more details on using hashes here.

File details

Details for the file polydup-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7c24fbf9f19e35e0f5e23f91cc78b9080175ae977e91aa71f57f4d76f7189cb0
MD5 0236efbab9442cd2d38d2e77d204afce
BLAKE2b-256 49e8bb8e7bee5b1c39cdc7007e0fd6382f82eee2ceca0abc693dba9b74416fce

See more details on using hashes here.

File details

Details for the file polydup-0.1.3-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.1.3-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 49d2b00915b1083dc0c1ee17cab1f11dd31edfc5f25a2fdc560a7ab8d062ebe1
MD5 47cd4e038943924f05ccd07e422f2f53
BLAKE2b-256 6630a997fe41e152cb866da2a095f3cc841fca9537432bb8e87ca4f6d910b07b

See more details on using hashes here.

File details

Details for the file polydup-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polydup-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8987a031c7e5ce12053e4de3190a31513f657beb24a372582578b602c329dafb
MD5 fa283f96ff34780a9840a7036ce4b87b
BLAKE2b-256 97021f9e062113fb91ad6a870ed4d186ee808890b7674c7b02dc3845d6d8322c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page