Skip to main content

Cross-language duplicate code detector - Python library and CLI

Project description

PolyDup Python Package

Cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing. This package provides both a CLI tool and Python library bindings.

Features

  • Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
  • Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
  • Type-3 clone detection: Gap-tolerant matching for near-duplicates
  • CLI included: Full command-line interface bundled in the package
  • GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
  • Parallel processing: Built on Rayon for multi-core performance

Installation

pip install polydup

CLI Usage

After installation, the polydup command is available:

# Scan a directory for duplicates
polydup scan ./src

# Scan with custom settings
polydup scan ./src --min-block-size 30 --similarity 0.9

# Output as JSON
polydup scan ./src --format json

# Enable Type-3 detection for near-duplicates
polydup scan ./src --enable-type3

# Scan only changed files in a git repository
polydup scan --git-diff origin/main..HEAD

# Get help
polydup --help
polydup scan --help

Exit Codes

  • 0: No duplicates found (clean)
  • 1: Duplicates found
  • 2: Error occurred

Library Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} <-> {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")
    print(f"  Clone type: {dup.clone_type}")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Type-3 Clone Detection

Enable fuzzy matching for near-duplicates:

import polydup

report = polydup.find_duplicates(
    paths=['./src'],
    min_block_size=50,
    threshold=0.85,
    enable_type3=True,      # Enable Type-3 detection
    type3_tolerance=0.85    # 85% similarity threshold
)

for dup in report.duplicates:
    if dup.clone_type == "type-3":
        print(f"Near-duplicate: {dup.file1} <-> {dup.file2}")
        print(f"  Edit distance: {dup.edit_distance}")

Concurrent Execution

PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)

Scan files for duplicate code and return a Report object.

Parameters:

  • paths (list[str]): List of file or directory paths to scan
  • min_block_size (int, optional): Minimum code block size in tokens. Default: 50
  • threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85
  • enable_type3 (bool, optional): Enable Type-3 clone detection. Default: False
  • type3_tolerance (float, optional): Type-3 similarity tolerance. Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails


find_duplicates_dict(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

  • files_scanned (int)
  • functions_analyzed (int)
  • duplicates (list[dict])
  • stats (dict)

version()

Get the PolyDup library version.

Returns: str (e.g., "0.9.0")


Class: Report

Attributes:

  • files_scanned (int): Number of files processed
  • functions_analyzed (int): Number of functions extracted
  • duplicates (list[DuplicateMatch]): List of detected duplicates
  • stats (ScanStats): Performance metrics

Methods:

  • to_dict(): Convert to Python dictionary
  • __len__(): Returns number of duplicates

Class: DuplicateMatch

Attributes:

  • file1 (str): First file path
  • file2 (str): Second file path
  • start_line1 (int): Starting line in first file
  • start_line2 (int): Starting line in second file
  • length (int): Length in tokens
  • similarity (float): Similarity score (0.0-1.0)
  • hash (str): Rolling hash value (hex string)
  • clone_type (str): "type-1", "type-2", or "type-3"
  • edit_distance (int | None): Edit distance for Type-3 clones

Methods:

  • to_dict(): Convert to Python dictionary

Class: ScanStats

Attributes:

  • total_lines (int): Total lines of code processed
  • total_tokens (int): Total tokens analyzed
  • unique_hashes (int): Number of unique code blocks
  • duration_ms (int): Scan duration in milliseconds

Methods:

  • to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

  1. Concurrent Python execution: Other Python threads continue running
  2. True parallelism: Rust's Rayon uses all CPU cores
  3. Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Development

Build from Source

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

License

MIT OR Apache-2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polydup-0.9.3.tar.gz (98.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file polydup-0.9.3.tar.gz.

File metadata

  • Download URL: polydup-0.9.3.tar.gz
  • Upload date:
  • Size: 98.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.9.3.tar.gz
Algorithm Hash digest
SHA256 2978d4ee1d03f4c268f9f0e430502529073c4e3fbf72376d2571b8ce96e1f240
MD5 7d1e263f1c0f648de4896adb653554f2
BLAKE2b-256 a9ea1b8edb4e93bcb3b71dffdbbcae20d27e8390763c9e7b2c8b36ebab89e7b8

See more details on using hashes here.

File details

Details for the file polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c509df4d9bb446c47f413abbc94e77a30c52fd1935de54f7248b91e4372fd3a5
MD5 c688f2f413c9bc2286e8f697bcc0fd2a
BLAKE2b-256 8a9a557ab5cd5bc62eb0831f0497b0a4e8ed94e836c4903fb5ab6132e2c3ef1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page