Cross-language duplicate code detector - Python library and CLI

These details have not been verified by PyPI

Project description

PolyDup Python Package

Cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing. This package provides both a CLI tool and Python library bindings.

Features

Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
Type-3 clone detection: Gap-tolerant matching for near-duplicates
CLI included: Full command-line interface bundled in the package
GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
Parallel processing: Built on Rayon for multi-core performance

Installation

pip install polydup

CLI Usage

After installation, the polydup command is available:

# Scan a directory for duplicates
polydup scan ./src

# Scan with custom settings
polydup scan ./src --min-block-size 30 --similarity 0.9

# Output as JSON
polydup scan ./src --format json

# Enable Type-3 detection for near-duplicates
polydup scan ./src --enable-type3

# Scan only changed files in a git repository
polydup scan --git-diff origin/main..HEAD

# Get help
polydup --help
polydup scan --help

Exit Codes

0: No duplicates found (clean)
1: Duplicates found
2: Error occurred

Library Usage

Basic Example

import polydup

# Scan a directory for duplicates
report = polydup.find_duplicates(
    paths=['./src', './lib'],
    min_block_size=50,    # Minimum tokens per code block
    threshold=0.85        # 85% similarity threshold
)

print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")

# Iterate through duplicates
for dup in report.duplicates:
    print(f"\n{dup.file1}:{dup.start_line1} <-> {dup.file2}:{dup.start_line2}")
    print(f"  Similarity: {dup.similarity * 100:.1f}%")
    print(f"  Length: {dup.length} tokens")
    print(f"  Clone type: {dup.clone_type}")

Dictionary Output

For JSON serialization or dict-based workflows:

import polydup
import json

report_dict = polydup.find_duplicates_dict(
    paths=['./src'],
    min_block_size=30,
    threshold=0.9
)

# Serialize to JSON
print(json.dumps(report_dict, indent=2))

Type-3 Clone Detection

Enable fuzzy matching for near-duplicates:

import polydup

report = polydup.find_duplicates(
    paths=['./src'],
    min_block_size=50,
    threshold=0.85,
    enable_type3=True,      # Enable Type-3 detection
    type3_tolerance=0.85    # 85% similarity threshold
)

for dup in report.duplicates:
    if dup.clone_type == "type-3":
        print(f"Near-duplicate: {dup.file1} <-> {dup.file2}")
        print(f"  Edit distance: {dup.edit_distance}")

Concurrent Execution

PolyDup releases the GIL during scanning, allowing concurrent Python code:

import polydup
import concurrent.futures

def scan_project(path):
    return polydup.find_duplicates([path])

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    # These scans run in parallel thanks to GIL release
    futures = [
        executor.submit(scan_project, './project1'),
        executor.submit(scan_project, './project2'),
        executor.submit(scan_project, './project3'),
    ]

    for future in concurrent.futures.as_completed(futures):
        report = future.result()
        print(f"Found {len(report.duplicates)} duplicates")

API Reference

`find_duplicates(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)`

Scan files for duplicate code and return a Report object.

Parameters:

paths (list[str]): List of file or directory paths to scan
min_block_size (int, optional): Minimum code block size in tokens. Default: 50
threshold (float, optional): Similarity threshold (0.0-1.0). Default: 0.85
enable_type3 (bool, optional): Enable Type-3 clone detection. Default: False
type3_tolerance (float, optional): Type-3 similarity tolerance. Default: 0.85

Returns: Report object with scan results

Raises: RuntimeError if scanning fails

`find_duplicates_dict(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)`

Same as find_duplicates() but returns a Python dictionary.

Returns: dict with keys:

files_scanned (int)
functions_analyzed (int)
duplicates (list[dict])
stats (dict)

`version()`

Get the PolyDup library version.

Returns: str (e.g., "0.9.0")

Class: `Report`

Attributes:

files_scanned (int): Number of files processed
functions_analyzed (int): Number of functions extracted
duplicates (list[DuplicateMatch]): List of detected duplicates
stats (ScanStats): Performance metrics

Methods:

to_dict(): Convert to Python dictionary
__len__(): Returns number of duplicates

Class: `DuplicateMatch`

Attributes:

file1 (str): First file path
file2 (str): Second file path
start_line1 (int): Starting line in first file
start_line2 (int): Starting line in second file
length (int): Length in tokens
similarity (float): Similarity score (0.0-1.0)
hash (str): Rolling hash value (hex string)
clone_type (str): "type-1", "type-2", or "type-3"
edit_distance (int | None): Edit distance for Type-3 clones

Methods:

to_dict(): Convert to Python dictionary

Class: `ScanStats`

Attributes:

total_lines (int): Total lines of code processed
total_tokens (int): Total tokens analyzed
unique_hashes (int): Number of unique code blocks
duration_ms (int): Scan duration in milliseconds

Methods:

to_dict(): Convert to Python dictionary

Performance

PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:

Concurrent Python execution: Other Python threads continue running
True parallelism: Rust's Rayon uses all CPU cores
Minimal overhead: Zero-copy FFI with direct Rust integration

Benchmark Example

import polydup
import time

start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start

print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")

Development

Build from Source

cd crates/polydup-py
maturin develop  # Debug build
maturin develop --release  # Optimized build

Test

python test.py

License

MIT OR Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.9.3

Jan 25, 2026

0.9.1

Jan 18, 2026

0.9.0

Jan 2, 2026

0.8.2

Jan 1, 2026

0.8.1

Dec 31, 2025

0.8.0

Dec 31, 2025

0.7.0

Dec 31, 2025

0.6.2

Dec 30, 2025

0.6.1

Dec 30, 2025

0.5.5

Dec 25, 2025

0.5.4

Dec 25, 2025

0.5.3

Dec 25, 2025

0.5.2

Dec 25, 2025

0.5.1

Dec 25, 2025

0.5.0

Dec 25, 2025

0.4.1

Dec 25, 2025

0.4.0

Dec 25, 2025

0.3.3

Dec 25, 2025

0.3.2

Dec 25, 2025

0.3.1

Dec 25, 2025

0.3.0

Dec 25, 2025

0.2.7

Dec 23, 2025

0.2.6

Dec 23, 2025

0.2.5

Dec 23, 2025

0.2.4

Dec 23, 2025

0.2.3

Dec 23, 2025

0.1.3

Dec 22, 2025

0.1.2

Dec 22, 2025

0.1.1

Dec 22, 2025

0.1.0

Dec 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polydup-0.9.3.tar.gz (98.1 kB view details)

Uploaded Jan 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded Jan 25, 2026 CPython 3.9macOS 11.0+ ARM64

File details

Details for the file polydup-0.9.3.tar.gz.

File metadata

Download URL: polydup-0.9.3.tar.gz
Upload date: Jan 25, 2026
Size: 98.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.9.3.tar.gz
Algorithm	Hash digest
SHA256	`2978d4ee1d03f4c268f9f0e430502529073c4e3fbf72376d2571b8ce96e1f240`
MD5	`7d1e263f1c0f648de4896adb653554f2`
BLAKE2b-256	`a9ea1b8edb4e93bcb3b71dffdbbcae20d27e8390763c9e7b2c8b36ebab89e7b8`

See more details on using hashes here.

File details

Details for the file polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

Download URL: polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl
Upload date: Jan 25, 2026
Size: 1.4 MB
Tags: CPython 3.9, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.10.2

File hashes

Hashes for polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`c509df4d9bb446c47f413abbc94e77a30c52fd1935de54f7248b91e4372fd3a5`
MD5	`c688f2f413c9bc2286e8f697bcc0fd2a`
BLAKE2b-256	`8a9a557ab5cd5bc62eb0831f0497b0a4e8ed94e836c4903fb5ab6132e2c3ef1c`

See more details on using hashes here.

polydup 0.9.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PolyDup Python Package

Features

Installation

CLI Usage

Exit Codes

Library Usage

Basic Example

Dictionary Output

Type-3 Clone Detection

Concurrent Execution

API Reference

find_duplicates(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)

find_duplicates_dict(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)

version()

Class: Report

Class: DuplicateMatch

Class: ScanStats

Performance

Benchmark Example

Development

Build from Source

Test

License

Links

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`find_duplicates(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)`

`find_duplicates_dict(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)`

`version()`

Class: `Report`

Class: `DuplicateMatch`

Class: `ScanStats`