Cross-language duplicate code detector - Python library and CLI
Project description
PolyDup Python Package
Cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing. This package provides both a CLI tool and Python library bindings.
Features
- Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
- Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
- Type-3 clone detection: Gap-tolerant matching for near-duplicates
- CLI included: Full command-line interface bundled in the package
- GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
- Parallel processing: Built on Rayon for multi-core performance
Installation
pip install polydup
CLI Usage
After installation, the polydup command is available:
# Scan a directory for duplicates
polydup scan ./src
# Scan with custom settings
polydup scan ./src --min-block-size 30 --similarity 0.9
# Output as JSON
polydup scan ./src --format json
# Enable Type-3 detection for near-duplicates
polydup scan ./src --enable-type3
# Scan only changed files in a git repository
polydup scan --git-diff origin/main..HEAD
# Get help
polydup --help
polydup scan --help
Exit Codes
0: No duplicates found (clean)1: Duplicates found2: Error occurred
Library Usage
Basic Example
import polydup
# Scan a directory for duplicates
report = polydup.find_duplicates(
paths=['./src', './lib'],
min_block_size=50, # Minimum tokens per code block
threshold=0.85 # 85% similarity threshold
)
print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")
# Iterate through duplicates
for dup in report.duplicates:
print(f"\n{dup.file1}:{dup.start_line1} <-> {dup.file2}:{dup.start_line2}")
print(f" Similarity: {dup.similarity * 100:.1f}%")
print(f" Length: {dup.length} tokens")
print(f" Clone type: {dup.clone_type}")
Dictionary Output
For JSON serialization or dict-based workflows:
import polydup
import json
report_dict = polydup.find_duplicates_dict(
paths=['./src'],
min_block_size=30,
threshold=0.9
)
# Serialize to JSON
print(json.dumps(report_dict, indent=2))
Type-3 Clone Detection
Enable fuzzy matching for near-duplicates:
import polydup
report = polydup.find_duplicates(
paths=['./src'],
min_block_size=50,
threshold=0.85,
enable_type3=True, # Enable Type-3 detection
type3_tolerance=0.85 # 85% similarity threshold
)
for dup in report.duplicates:
if dup.clone_type == "type-3":
print(f"Near-duplicate: {dup.file1} <-> {dup.file2}")
print(f" Edit distance: {dup.edit_distance}")
Concurrent Execution
PolyDup releases the GIL during scanning, allowing concurrent Python code:
import polydup
import concurrent.futures
def scan_project(path):
return polydup.find_duplicates([path])
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
# These scans run in parallel thanks to GIL release
futures = [
executor.submit(scan_project, './project1'),
executor.submit(scan_project, './project2'),
executor.submit(scan_project, './project3'),
]
for future in concurrent.futures.as_completed(futures):
report = future.result()
print(f"Found {len(report.duplicates)} duplicates")
API Reference
find_duplicates(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)
Scan files for duplicate code and return a Report object.
Parameters:
paths(list[str]): List of file or directory paths to scanmin_block_size(int, optional): Minimum code block size in tokens. Default: 50threshold(float, optional): Similarity threshold (0.0-1.0). Default: 0.85enable_type3(bool, optional): Enable Type-3 clone detection. Default: Falsetype3_tolerance(float, optional): Type-3 similarity tolerance. Default: 0.85
Returns: Report object with scan results
Raises: RuntimeError if scanning fails
find_duplicates_dict(paths, min_block_size=50, threshold=0.85, enable_type3=False, type3_tolerance=0.85)
Same as find_duplicates() but returns a Python dictionary.
Returns: dict with keys:
files_scanned(int)functions_analyzed(int)duplicates(list[dict])stats(dict)
version()
Get the PolyDup library version.
Returns: str (e.g., "0.9.0")
Class: Report
Attributes:
files_scanned(int): Number of files processedfunctions_analyzed(int): Number of functions extractedduplicates(list[DuplicateMatch]): List of detected duplicatesstats(ScanStats): Performance metrics
Methods:
to_dict(): Convert to Python dictionary__len__(): Returns number of duplicates
Class: DuplicateMatch
Attributes:
file1(str): First file pathfile2(str): Second file pathstart_line1(int): Starting line in first filestart_line2(int): Starting line in second filelength(int): Length in tokenssimilarity(float): Similarity score (0.0-1.0)hash(str): Rolling hash value (hex string)clone_type(str): "type-1", "type-2", or "type-3"edit_distance(int | None): Edit distance for Type-3 clones
Methods:
to_dict(): Convert to Python dictionary
Class: ScanStats
Attributes:
total_lines(int): Total lines of code processedtotal_tokens(int): Total tokens analyzedunique_hashes(int): Number of unique code blocksduration_ms(int): Scan duration in milliseconds
Methods:
to_dict(): Convert to Python dictionary
Performance
PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:
- Concurrent Python execution: Other Python threads continue running
- True parallelism: Rust's Rayon uses all CPU cores
- Minimal overhead: Zero-copy FFI with direct Rust integration
Benchmark Example
import polydup
import time
start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start
print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")
Development
Build from Source
cd crates/polydup-py
maturin develop # Debug build
maturin develop --release # Optimized build
Test
python test.py
License
MIT OR Apache-2.0
Links
- GitHub: https://github.com/wiesnerbernard/polydup
- PyPI: https://pypi.org/project/polydup/
- CLI Tool: polydup-cli
- Core Library: polydup-core
- Node.js Bindings: polydup-node
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polydup-0.9.3.tar.gz.
File metadata
- Download URL: polydup-0.9.3.tar.gz
- Upload date:
- Size: 98.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2978d4ee1d03f4c268f9f0e430502529073c4e3fbf72376d2571b8ce96e1f240
|
|
| MD5 |
7d1e263f1c0f648de4896adb653554f2
|
|
| BLAKE2b-256 |
a9ea1b8edb4e93bcb3b71dffdbbcae20d27e8390763c9e7b2c8b36ebab89e7b8
|
File details
Details for the file polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: polydup-0.9.3-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c509df4d9bb446c47f413abbc94e77a30c52fd1935de54f7248b91e4372fd3a5
|
|
| MD5 |
c688f2f413c9bc2286e8f697bcc0fd2a
|
|
| BLAKE2b-256 |
8a9a557ab5cd5bc62eb0831f0497b0a4e8ed94e836c4903fb5ab6132e2c3ef1c
|