Skip to main content

Code duplication detector (Rabin-Karp, language-agnostic) — single native binary

Project description

open-harness-dupelens

Code duplication detector. Uses Rabin-Karp rolling-hash fingerprinting over tokenized source — strings and comments are stripped before hashing to reduce false positives. Language-agnostic (Go, TS, JS, Python, Rust, Java, etc.). Single native binary, zero runtime dependencies.

Part of the open-harness monorepo. Español abajo.

Same tool, other ecosystems: also available on npm (@open_harness/dupelens) and on Packagist (open-harness/dupelens). Identical binary, identical config; pick the registry that matches your stack.

Install

pip install open-harness-dupelens

pip picks the right native wheel for your platform automatically (Linux x86_64, macOS arm64, macOS x86_64, Windows x86_64). Each wheel embeds the Go binary — no runtime deps.

Usage

dupelens check                  # scan current directory with defaults
dupelens check --fail           # exit 1 if duplicates found (CI / git hooks)
dupelens check --min-tokens 30  # override the rolling window size
dupelens check --format=json    # JSON output for tooling integrations
dupelens check --dir ./src      # scan a specific directory
dupelens check --verbose        # print timings to stderr
dupelens check --no-color       # plain console output
dupelens init                   # generate a default dupelens.json
dupelens version                # print version

Configuration

Place a dupelens.json at the repo root:

{
  "default": {
    "minTokens": 50,
    "minLines": 5
  },
  "rules": [
    { "pattern": "**/*_test.go",     "skip": true },
    { "pattern": "**/migrations/**", "skip": true }
  ],
  "exclude": ["node_modules", "vendor", ".git", "dist", "build"]
}
  • minTokens — window size of the rolling hash. Higher values catch only larger duplications.
  • minLines — filters short matches (e.g. back-to-back identical imports).
  • rules — per-pattern skip. The first matching entry wins.

Alternative: configure inside pyproject.toml or the dedicated dupelens.json

If you prefer not to keep a separate dupelens.json, add a dupelens key in your package.json with the same shape:

{
  "name": "my-project",
  "dupelens": {
    "default": { "minTokens": 50, "minLines": 5 },
    "rules": [{ "pattern": "**/*_test.go", "skip": true }],
    "exclude": ["node_modules", "dist"]
  }
}

Precedence: --config <path> > dupelens.json > package.json key > built-in defaults. CLI flags (--min-tokens, --format, etc.) always win.

Output (console)

DUPLICATES (2 match(es) found in 87 files):

  src/auth.go:42-58  <->  src/users.go:12-28  (35 tokens)
  | func validate(input string) error {
  | ...
  src/db.go:1-10  <->  src/cache.go:1-10  (15 tokens)

SUMMARY: 2 match(es) across 87 files
Top duplicated files:
  - src/auth.go  (1 match(es))

Output (JSON)

{
  "scannedFiles": 87,
  "matchCount": 2,
  "matches": [
    {
      "fileA": "src/auth.go", "startLineA": 42, "endLineA": 58,
      "fileB": "src/users.go", "startLineB": 12, "endLineB": 28,
      "tokens": 35
    }
  ],
  "summary": {
    "topDuplicatedFiles": [{ "file": "src/auth.go", "count": 1 }]
  }
}

Integrations

# Husky pre-commit
dupelens check --fail
# GitHub Actions
- name: Run dupelens
  run: npx @open_harness/dupelens check --fail

Why Rabin-Karp over AST?

  • Zero dependencies: no language-specific parsers to ship per language.
  • Language-agnostic: the same binary scans Go, TypeScript, Python, Rust, Java, etc.
  • Fast: rolling hash detects matches in O(n) over the token stream.

The trade-off is documented in ADR-012.

Limitations (v0.2.0)

  • Detects only literal or near-literal duplication (token-by-token). Refactors with renamed variables are not flagged — that requires AST analysis.
  • The algorithm is binary (match or no match); there is no similarity threshold flag.
  • Per-rule minTokens override does not work cross-file because window sizes must be uniform. Use rules.skip to exclude patterns entirely.

Exit codes

Code Meaning
0 No duplicates (or --fail not passed)
1 Duplicates found and --fail was passed, or config error

Español

Detector de duplicación de código. Usa fingerprinting Rabin-Karp (hash rodante) sobre el código tokenizado — los strings y comentarios se eliminan antes del hashing para reducir falsos positivos. Agnóstico al lenguaje (Go, TS, JS, Python, Rust, Java, etc.). Un solo binario nativo, cero dependencias.

Parte del monorepo open-harness.

Instalación

pip install open-harness-dupelens

pip descarga automáticamente la wheel nativa correcta para tu plataforma.

Uso

dupelens check                  # escanea con defaults
dupelens check --fail           # exit 1 si hay duplicados (CI / git hooks)
dupelens check --min-tokens 30  # cambia el tamaño de ventana del hash rodante
dupelens check --format=json    # salida JSON para integraciones
dupelens check --dir ./src      # escanea un directorio específico
dupelens check --verbose        # imprime timings en stderr
dupelens check --no-color       # consola sin colores
dupelens init                   # genera un dupelens.json por defecto
dupelens version                # imprime la versión

Configuración

Colocá un dupelens.json en la raíz del repo (ver ejemplo arriba).

  • minTokens — tamaño de la ventana del hash rodante. Valores más altos detectan solo duplicaciones más grandes.
  • minLines — filtra matches cortos (ej. imports idénticos consecutivos).
  • rulesskip por patrón. Gana la primera regla coincidente.

Alternativa: configurar dentro de pyproject.toml o dupelens.json

Si preferís no tener un dupelens.json separado, agregá una key dupelens en tu package.json con la misma forma del archivo dedicado. Precedencia: --config <path> > dupelens.json > key en package.json > defaults. Los flags CLI (--min-tokens, --format, etc.) siempre ganan.

Salida

Soporta consola coloreada y JSON estructurado. Ver ejemplos arriba.

Integraciones

Sirve con Husky, lefthook o GitHub Actions usando los mismos snippets de la sección en inglés.

Por qué Rabin-Karp en vez de AST

  • Cero dependencias: no hay que enviar parsers por lenguaje.
  • Agnóstico: el mismo binario escanea Go, TypeScript, Python, Rust, Java, etc.
  • Rápido: el hash rodante detecta matches en O(n) sobre el stream de tokens.

El trade-off está documentado en ADR-012.

Limitaciones (v0.2.0)

  • Solo detecta duplicación literal o cuasi-literal (token a token). Refactors con variables renombradas no se detectan — eso requiere análisis AST.
  • El algoritmo es binario (hay match o no hay); no existe un flag de umbral de similitud.
  • El override de minTokens por regla no funciona entre archivos porque la ventana debe ser uniforme. Usá rules.skip para excluir patrones por completo.

Códigos de salida

Código Significado
0 Sin duplicados (o no se pasó --fail)
1 Hay duplicados con --fail, o error de configuración

License

MIT — see the main repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

open_harness_dupelens-0.2.1-py3-none-win_amd64.whl (1.8 MB view details)

Uploaded Python 3Windows x86-64

open_harness_dupelens-0.2.1-py3-none-macosx_11_0_arm64.whl (825.3 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

open_harness_dupelens-0.2.1-py3-none-macosx_10_9_x86_64.whl (869.6 kB view details)

Uploaded Python 3macOS 10.9+ x86-64

File details

Details for the file open_harness_dupelens-0.2.1-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for open_harness_dupelens-0.2.1-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 34060c01c651f78a6fee6eeba6cf7867ce54fd2f230a2f79b1ddfdd64f1ac098
MD5 b1fde91e29711af1f086998fd4d701a1
BLAKE2b-256 a8ce89f9466403d8ae5b837c5967acc7d820fb5e8a382fb51ef87379faf08764

See more details on using hashes here.

File details

Details for the file open_harness_dupelens-0.2.1-py3-none-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for open_harness_dupelens-0.2.1-py3-none-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 64506d24a4f5ae41662d2085a8bcfc8dcb80791de73e4543f2cdb171e94c19c1
MD5 81d2b75ed2e5abd46eda981b6b76304b
BLAKE2b-256 be09a41febe7f8e761b976be2fd090d6cd89a36397bd62ed9fd64bc3b8557417

See more details on using hashes here.

File details

Details for the file open_harness_dupelens-0.2.1-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for open_harness_dupelens-0.2.1-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d65d68ce888a4fbee146ccace945d9ff377148ca8e40ce42355027193b9ae2f1
MD5 e7fc82f194c5c57f2be204e7cfde542a
BLAKE2b-256 db8b07181a1ce4bdf9186873faec4f44f2ec651b1253a574fa35190b6a42db21

See more details on using hashes here.

File details

Details for the file open_harness_dupelens-0.2.1-py3-none-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for open_harness_dupelens-0.2.1-py3-none-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 dfd5b3ee3ca5c57d857925050adc19b378d6fb8d3d6c4eaa75dd49b01e596479
MD5 0bcc258d11b835e23f4b2cef1b778d8b
BLAKE2b-256 943590e702fe59109ee6c1c407110e3d3eb8ccba0354e9dd54e3c81b684e1838

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page