Cross-language duplicate code detector
Project description
PolyDup Python Bindings
Python bindings for PolyDup, a cross-language duplicate code detector powered by Tree-sitter and Rabin-Karp hashing.
Features
- Multi-language support: Detect duplicates across Rust, Python, and JavaScript/TypeScript
- Type-2 clone detection: Finds structurally similar code (normalized identifiers/literals)
- GIL-free scanning: Releases Python's Global Interpreter Lock during CPU-intensive operations
- Parallel processing: Built on Rayon for multi-core performance
- Zero-copy architecture: Direct FFI to Rust core for minimal overhead
Installation
From Source (Development)
cd crates/polydup-py
maturin develop --release
From PyPI (Future)
pip install polydup
Usage
Basic Example
import polydup
# Scan a directory for duplicates
report = polydup.find_duplicates(
paths=['./src', './lib'],
min_block_size=50, # Minimum tokens per code block
threshold=0.85 # 85% similarity threshold
)
print(f"Scanned {report.files_scanned} files")
print(f"Analyzed {report.functions_analyzed} functions")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Took {report.stats.duration_ms}ms")
# Iterate through duplicates
for dup in report.duplicates:
print(f"\n{dup.file1}:{dup.start_line1} ↔️ {dup.file2}:{dup.start_line2}")
print(f" Similarity: {dup.similarity * 100:.1f}%")
print(f" Length: {dup.length} tokens")
Dictionary Output
For JSON serialization or dict-based workflows:
import polydup
import json
report_dict = polydup.find_duplicates_dict(
paths=['./src'],
min_block_size=30,
threshold=0.9
)
# Serialize to JSON
print(json.dumps(report_dict, indent=2))
Concurrent Execution
Critical: PolyDup releases the GIL during scanning, allowing concurrent Python code:
import polydup
import concurrent.futures
def scan_project(path):
return polydup.find_duplicates([path])
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
# These scans run in parallel thanks to GIL release
futures = [
executor.submit(scan_project, './project1'),
executor.submit(scan_project, './project2'),
executor.submit(scan_project, './project3'),
]
for future in concurrent.futures.as_completed(futures):
report = future.result()
print(f"Found {len(report.duplicates)} duplicates")
API Reference
find_duplicates(paths, min_block_size=50, threshold=0.85)
Scan files for duplicate code and return a Report object.
Parameters:
paths(list[str]): List of file or directory paths to scanmin_block_size(int, optional): Minimum code block size in tokens. Default: 50threshold(float, optional): Similarity threshold (0.0-1.0). Default: 0.85
Returns: Report object with scan results
Raises: RuntimeError if scanning fails
find_duplicates_dict(paths, min_block_size=50, threshold=0.85)
Same as find_duplicates() but returns a Python dictionary.
Returns: dict with keys:
files_scanned(int)functions_analyzed(int)duplicates(list[dict])stats(dict)
version()
Get the PolyDup library version.
Returns: str (e.g., "0.1.0")
Class: Report
Attributes:
files_scanned(int): Number of files processedfunctions_analyzed(int): Number of functions extractedduplicates(list[DuplicateMatch]): List of detected duplicatesstats(ScanStats): Performance metrics
Methods:
to_dict(): Convert to Python dictionary__len__(): Returns number of duplicates
Class: DuplicateMatch
Attributes:
file1(str): First file pathfile2(str): Second file pathstart_line1(int): Starting line in first filestart_line2(int): Starting line in second filelength(int): Length in tokenssimilarity(float): Similarity score (0.0-1.0)hash(str): Rolling hash value (hex string)
Methods:
to_dict(): Convert to Python dictionary
Class: ScanStats
Attributes:
total_lines(int): Total lines of code processedtotal_tokens(int): Total tokens analyzedunique_hashes(int): Number of unique code blocksduration_ms(int): Scan duration in milliseconds
Methods:
to_dict(): Convert to Python dictionary
Performance
PolyDup's Python bindings use py.allow_threads() to release the Global Interpreter Lock during scanning. This enables:
- Concurrent Python execution: Other Python threads continue running
- True parallelism: Rust's Rayon uses all CPU cores
- Minimal overhead: Zero-copy FFI with direct Rust integration
Benchmark Example
import polydup
import time
start = time.time()
report = polydup.find_duplicates(['./large-project'], min_block_size=30)
elapsed = time.time() - start
print(f"Scanned {report.files_scanned} files in {elapsed:.2f}s")
print(f"Found {len(report.duplicates)} duplicates")
print(f"Throughput: {report.stats.total_tokens / elapsed:.0f} tokens/sec")
Algorithm
PolyDup uses:
- Tree-sitter for language-agnostic AST parsing
- Token normalization for Type-2 clone detection (e.g.,
userId→$$ID) - Rabin-Karp rolling hash with window size 50 for efficient similarity detection
- Rayon for parallel processing across CPU cores
See architecture-research.md for detailed algorithm analysis.
Development
Build
cd crates/polydup-py
maturin develop # Debug build
maturin develop --release # Optimized build
Test
python test.py
Type Checking
pip install mypy
mypy test.py
License
MIT OR Apache-2.0
Links
- GitHub: https://github.com/wiesnerbernard/polydup
- Core Library: polydup-core
- CLI Tool: polydup-cli
- Node.js Bindings: polydup-node
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polydup-0.2.3-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: polydup-0.2.3-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
131006b7626be6853c7f6c83f1b7e01b5c7afa9439e75ef16eca2a905ab9c256
|
|
| MD5 |
d87c211b4e4831f9ba7a8d3274618433
|
|
| BLAKE2b-256 |
b6243b2a67dba7311053d1dc079f6cc7b7eb5b7f9593987c71b60ad7e89652a3
|
File details
Details for the file polydup-0.2.3-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: polydup-0.2.3-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2896df1f5b577f3b38041b99ae9596bfd3e7073c0410c9152d167ad7d58cebe
|
|
| MD5 |
d3074693746e1b590f64243f0844c835
|
|
| BLAKE2b-256 |
334633be5f679f7353548cd644fa0b0c674cc559d694ba40c825ed69e3a7316c
|
File details
Details for the file polydup-0.2.3-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: polydup-0.2.3-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8940e737ea7a96f3c414bb1d25c340139b1776e499a1233f83993772c0469ed3
|
|
| MD5 |
e1d0dc754d7677cc4ddcf2f48f5e3dc6
|
|
| BLAKE2b-256 |
257ca0941e3e1c746ce704d10da1fd0765f94c6c957e6bd890ae695fff6d4e94
|
File details
Details for the file polydup-0.2.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: polydup-0.2.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf2cf3a3ffa23da7cda08126ea55f228899c74a1787c016c947a437bb50a0253
|
|
| MD5 |
0f4c500a010de1d4fa3969fc44fafbc5
|
|
| BLAKE2b-256 |
1a64f479adf9fa86b7ea4f4a4077739c387614c99bef23b148d5d1a182ea49a1
|