Code duplication detector (Rabin-Karp, language-agnostic) — single native binary
Project description
open-harness-dupelens
Code duplication detector. Uses Rabin-Karp rolling-hash fingerprinting over tokenized source — strings and comments are stripped before hashing to reduce false positives. Language-agnostic (Go, TS, JS, Python, Rust, Java, etc.). Single native binary, zero runtime dependencies.
Part of the open-harness monorepo. Español abajo.
Same tool, other ecosystems: also available on npm (
@open_harness/dupelens) and on Packagist (open-harness/dupelens). Identical binary, identical config; pick the registry that matches your stack.
Install
pip install open-harness-dupelens
pip picks the right native wheel for your platform automatically (Linux x86_64, macOS arm64, macOS x86_64, Windows x86_64). Each wheel embeds the Go binary — no runtime deps.
Usage
dupelens check # scan current directory with defaults
dupelens check --fail # exit 1 if duplicates found (CI / git hooks)
dupelens check --min-tokens 30 # override the rolling window size
dupelens check --format=json # JSON output for tooling integrations
dupelens check --dir ./src # scan a specific directory
dupelens check --verbose # print timings to stderr
dupelens check --no-color # plain console output
dupelens init # generate a default dupelens.json
dupelens version # print version
Configuration
Place a dupelens.json at the repo root:
{
"default": {
"minTokens": 50,
"minLines": 5
},
"rules": [
{ "pattern": "**/*_test.go", "skip": true },
{ "pattern": "**/migrations/**", "skip": true }
],
"exclude": ["node_modules", "vendor", ".git", "dist", "build"]
}
minTokens— window size of the rolling hash. Higher values catch only larger duplications.minLines— filters short matches (e.g. back-to-back identical imports).rules— per-patternskip. The first matching entry wins.
Alternative: configure inside pyproject.toml or the dedicated dupelens.json
If you prefer not to keep a separate dupelens.json, add a dupelens key in your package.json with the same shape:
{
"name": "my-project",
"dupelens": {
"default": { "minTokens": 50, "minLines": 5 },
"rules": [{ "pattern": "**/*_test.go", "skip": true }],
"exclude": ["node_modules", "dist"]
}
}
Precedence: --config <path> > dupelens.json > package.json key > built-in defaults. CLI flags (--min-tokens, --format, etc.) always win.
Output (console)
DUPLICATES (2 match(es) found in 87 files):
src/auth.go:42-58 <-> src/users.go:12-28 (35 tokens)
| func validate(input string) error {
| ...
src/db.go:1-10 <-> src/cache.go:1-10 (15 tokens)
SUMMARY: 2 match(es) across 87 files
Top duplicated files:
- src/auth.go (1 match(es))
Output (JSON)
{
"scannedFiles": 87,
"matchCount": 2,
"matches": [
{
"fileA": "src/auth.go", "startLineA": 42, "endLineA": 58,
"fileB": "src/users.go", "startLineB": 12, "endLineB": 28,
"tokens": 35
}
],
"summary": {
"topDuplicatedFiles": [{ "file": "src/auth.go", "count": 1 }]
}
}
Integrations
# Husky pre-commit
dupelens check --fail
# GitHub Actions
- name: Run dupelens
run: npx @open_harness/dupelens check --fail
Why Rabin-Karp over AST?
- Zero dependencies: no language-specific parsers to ship per language.
- Language-agnostic: the same binary scans Go, TypeScript, Python, Rust, Java, etc.
- Fast: rolling hash detects matches in
O(n)over the token stream.
The trade-off is documented in ADR-012.
Limitations (v0.2.0)
- Detects only literal or near-literal duplication (token-by-token). Refactors with renamed variables are not flagged — that requires AST analysis.
- The algorithm is binary (match or no match); there is no similarity threshold flag.
- Per-rule
minTokensoverride does not work cross-file because window sizes must be uniform. Userules.skipto exclude patterns entirely.
Exit codes
| Code | Meaning |
|---|---|
0 |
No duplicates (or --fail not passed) |
1 |
Duplicates found and --fail was passed, or config error |
Español
Detector de duplicación de código. Usa fingerprinting Rabin-Karp (hash rodante) sobre el código tokenizado — los strings y comentarios se eliminan antes del hashing para reducir falsos positivos. Agnóstico al lenguaje (Go, TS, JS, Python, Rust, Java, etc.). Un solo binario nativo, cero dependencias.
Parte del monorepo open-harness.
Instalación
pip install open-harness-dupelens
pip descarga automáticamente la wheel nativa correcta para tu plataforma.
Uso
dupelens check # escanea con defaults
dupelens check --fail # exit 1 si hay duplicados (CI / git hooks)
dupelens check --min-tokens 30 # cambia el tamaño de ventana del hash rodante
dupelens check --format=json # salida JSON para integraciones
dupelens check --dir ./src # escanea un directorio específico
dupelens check --verbose # imprime timings en stderr
dupelens check --no-color # consola sin colores
dupelens init # genera un dupelens.json por defecto
dupelens version # imprime la versión
Configuración
Colocá un dupelens.json en la raíz del repo (ver ejemplo arriba).
minTokens— tamaño de la ventana del hash rodante. Valores más altos detectan solo duplicaciones más grandes.minLines— filtra matches cortos (ej. imports idénticos consecutivos).rules—skippor patrón. Gana la primera regla coincidente.
Alternativa: configurar dentro de pyproject.toml o dupelens.json
Si preferís no tener un dupelens.json separado, agregá una key dupelens en tu package.json con la misma forma del archivo dedicado. Precedencia: --config <path> > dupelens.json > key en package.json > defaults. Los flags CLI (--min-tokens, --format, etc.) siempre ganan.
Salida
Soporta consola coloreada y JSON estructurado. Ver ejemplos arriba.
Integraciones
Sirve con Husky, lefthook o GitHub Actions usando los mismos snippets de la sección en inglés.
Por qué Rabin-Karp en vez de AST
- Cero dependencias: no hay que enviar parsers por lenguaje.
- Agnóstico: el mismo binario escanea Go, TypeScript, Python, Rust, Java, etc.
- Rápido: el hash rodante detecta matches en
O(n)sobre el stream de tokens.
El trade-off está documentado en ADR-012.
Limitaciones (v0.2.0)
- Solo detecta duplicación literal o cuasi-literal (token a token). Refactors con variables renombradas no se detectan — eso requiere análisis AST.
- El algoritmo es binario (hay match o no hay); no existe un flag de umbral de similitud.
- El override de
minTokenspor regla no funciona entre archivos porque la ventana debe ser uniforme. Usárules.skippara excluir patrones por completo.
Códigos de salida
| Código | Significado |
|---|---|
0 |
Sin duplicados (o no se pasó --fail) |
1 |
Hay duplicados con --fail, o error de configuración |
License
MIT — see the main repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_harness_dupelens-0.2.1-py3-none-win_amd64.whl.
File metadata
- Download URL: open_harness_dupelens-0.2.1-py3-none-win_amd64.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34060c01c651f78a6fee6eeba6cf7867ce54fd2f230a2f79b1ddfdd64f1ac098
|
|
| MD5 |
b1fde91e29711af1f086998fd4d701a1
|
|
| BLAKE2b-256 |
a8ce89f9466403d8ae5b837c5967acc7d820fb5e8a382fb51ef87379faf08764
|
File details
Details for the file open_harness_dupelens-0.2.1-py3-none-manylinux2014_x86_64.whl.
File metadata
- Download URL: open_harness_dupelens-0.2.1-py3-none-manylinux2014_x86_64.whl
- Upload date:
- Size: 856.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64506d24a4f5ae41662d2085a8bcfc8dcb80791de73e4543f2cdb171e94c19c1
|
|
| MD5 |
81d2b75ed2e5abd46eda981b6b76304b
|
|
| BLAKE2b-256 |
be09a41febe7f8e761b976be2fd090d6cd89a36397bd62ed9fd64bc3b8557417
|
File details
Details for the file open_harness_dupelens-0.2.1-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: open_harness_dupelens-0.2.1-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 825.3 kB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d65d68ce888a4fbee146ccace945d9ff377148ca8e40ce42355027193b9ae2f1
|
|
| MD5 |
e7fc82f194c5c57f2be204e7cfde542a
|
|
| BLAKE2b-256 |
db8b07181a1ce4bdf9186873faec4f44f2ec651b1253a574fa35190b6a42db21
|
File details
Details for the file open_harness_dupelens-0.2.1-py3-none-macosx_10_9_x86_64.whl.
File metadata
- Download URL: open_harness_dupelens-0.2.1-py3-none-macosx_10_9_x86_64.whl
- Upload date:
- Size: 869.6 kB
- Tags: Python 3, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfd5b3ee3ca5c57d857925050adc19b378d6fb8d3d6c4eaa75dd49b01e596479
|
|
| MD5 |
0bcc258d11b835e23f4b2cef1b778d8b
|
|
| BLAKE2b-256 |
943590e702fe59109ee6c1c407110e3d3eb8ccba0354e9dd54e3c81b684e1838
|