Skip to main content

Detect duplicate Python definitions, text patterns, and token similarities for codebase maintainability.

Project description

duplifinder logo

PyPI version Python Smart Update Wheel Release

Build status Codecov Test Coverage Code style: black Ruff Security

Maintenance License: MIT

Duplifinder

The "Batteries Included" duplicate code detector. Detect and refactor duplicate Python classes, functions, and async defs—plus text and tokens across other languages—to keep your codebase lean and mean.


⚡ Quick Start (The "5-Minute Rule")

Prerequisites

  • Python 3.12+
  • pip (or uv/poetry)

Installation

pip install duplifinder

Usage Example

Get instant feedback on your current directory:

# Standard scan (AST + Token)
duplifinder .

# Watch mode for live feedback (Best for dev loop)
duplifinder . --watch --preview

# Scan with parallel processing and detailed audit logs
duplifinder src/ --parallel --audit --verbose

Pre-commit Hook

Add to your .pre-commit-config.yaml to block duplicates before they merge:

-   repo: https://github.com/dhruv13x/duplifinder
    rev: v11.0.0  # Use latest version
    hooks:
    -   id: duplifinder
        args: ["--fail", "--dup-threshold=0.05"]

✨ Features (The "Why")

Core Capabilities

  • AST-Powered Detection: Precision finding for ClassDef, FunctionDef, and AsyncFunctionDef (Python). It sees through variable name changes.
  • Multi-Language Support: Token and text-based similarity checks for Python, JavaScript, TypeScript, and Java.
  • Smart Watch Mode: "Live" scanning that updates results instantly as you modify files.

Performance & Security

  • Parallel Processing: Blazing fast scans using multi-threading or multi-processing (GIL-aware) with --parallel and --use-multiprocessing.
  • Smart Caching: Skips unchanged files to dramatically speed up re-scans.
  • Audit Logging: Enterprise-grade JSONL trails for file access and scan operations.

Developer Experience

  • Automated Refactoring Suggestions: "God Level" advice—tells you how to fix the duplication (e.g., "Extract to shared utility").
  • Rich Reporting: Beautiful console tables, JSON output for CI/CD, and formatted previews.

🛠️ Configuration (The "How")

Customize behavior via CLI flags or a .duplifinder.yaml file.

CLI Reference

Flag Description Default
<root> Positional argument: Root directory to scan. .
--config Path to a YAML configuration file. None
--watch Live scanning on file changes. False
--parallel Enable parallel file scanning (threading). False
--use-multiprocessing Use CPU cores (true parallelism) instead of threads. False
--max-workers Limit the number of parallel workers. Auto
--fail Exit with code 1 if duplicates found (CI mode). False
--json Output results in JSON format. False
-p, --preview Show the actual code snippets in the output. False
--audit Enable audit logging to file. False
--audit-log Path for the audit log file. .duplifinder_audit.jsonl
--token-mode Enable token-based fuzzy matching. False
--similarity-threshold Sensitivity for token matching (0.0 - 1.0). 0.8
--dup-threshold Alert if duplication rate exceeds this ratio. 0.1
-f, --find Specific types to find (class, def, async_def). All
--exclude-patterns Glob patterns to exclude (e.g., */migrations/*). None
--exclude-names Regex patterns for definition names to exclude. None
--no-gitignore Do NOT respect .gitignore files. False
--version Show version information. -

Configuration File (.duplifinder.yaml)

You can also use .duplifinder.yaml. The CLI args override these settings.

Key Description Default
root Root directory to scan .
ignore Comma-separated directory names to ignore .git, venv, etc.
exclude_patterns List of glob patterns to exclude []
token_mode Enable token-based fuzzy matching false
similarity_threshold Sensitivity for token matching 0.8
dup_threshold Duplication rate threshold for alerts 0.1
audit Enable audit logging false
parallel Enable parallel scanning false
watch Enable watch mode false

Note: Environment variables are not currently supported for configuration to ensure reproducibility via code.

# Example .duplifinder.yaml
root: src
ignore: "tests,legacy"
exclude_patterns: "*/migrations/*"
token_mode: true
similarity_threshold: 0.85
audit: true
parallel: true

🏗️ Architecture

Duplifinder uses a Strategy pattern to dispatch scanners based on file type and mode.

Directory Tree

src/duplifinder/
├── application.py       # Workflow orchestration
├── cli.py               # Argument parsing
├── config.py            # Pydantic configuration & validation
├── finder.py            # Strategy Dispatcher
├── definition_finder.py # AST-based Logic (Python)
├── token_finder.py      # Token-based Similarity (Multi-lang)
├── text_finder.py       # Regex Pattern Matcher
├── refactoring.py       # Refactoring Suggestion Engine
├── processors.py        # File I/O & Parallel Processing
├── output.py            # Rich Console & JSON Renderers
├── utils.py             # File discovery & Audit logging
└── watcher.py           # Watchdog event handling

Data Flow

  1. Discovery: utils.py recursively finds files, respecting .gitignore.
  2. Dispatch: finder.py selects the right strategy (AST, Token, or Text) based on file extension.
  3. Analysis: processors.py runs in parallel to extract definitions or tokens.
  4. Comparison: Hashes or token vectors are compared to find duplicates.
  5. Refactoring: refactoring.py analyzes results to generate actionable fixes.
  6. Reporting: Results are streamed to Console (using Rich), JSON, or HTML.

🐞 Troubleshooting

Issue Likely Cause Solution
No duplicates found Thresholds too high or wrong path. Lower --similarity-threshold (e.g., 0.6) or check <root>.
Scanning is slow Large vendor directories. Add folders to --ignore or .gitignore (e.g., node_modules, venv).
Memory usage high Very large files or too many threads. Reduce --max-workers or use --exclude-patterns for large generated files.
"Config validation failed" Invalid .yaml or args. Check error message and compare with CLI Reference.

Debug Mode: Run with --verbose to see detailed logs and performance metrics.


🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details on how to get started.

Dev Setup

  1. Clone the repo.
  2. Install dependencies: pip install -e ".[dev]"
  3. Run tests: pytest
  4. Linting: ruff check .

🗺️ Roadmap

See ROADMAP.md for the full vision.

  • Foundation: AST Detection, Parallelism, Rich Output.
  • Standard: Watch Mode, Refactoring Suggestions, Multi-language.
  • 🚧 Ecosystem (Next): IDE Plugins, GitHub Action, Webhooks.
  • 🔮 Vision: AI-Powered Refactoring, Cross-Repo Analysis.

Built with 💙 by Dhruv & the Open Source Community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duplifinder-11.0.1.tar.gz (48.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duplifinder-11.0.1-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file duplifinder-11.0.1.tar.gz.

File metadata

  • Download URL: duplifinder-11.0.1.tar.gz
  • Upload date:
  • Size: 48.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for duplifinder-11.0.1.tar.gz
Algorithm Hash digest
SHA256 5242c0d126a773d02ea045019eace8fa3bdf222db5a3bb6b504a1a9b653e991c
MD5 5c56f6f1bd6c645a5af0506139b9fbc1
BLAKE2b-256 5b4f847775184837530ed6b5b94370e163ac06b2696387cb917ebea7bae7b65b

See more details on using hashes here.

Provenance

The following attestation bundles were made for duplifinder-11.0.1.tar.gz:

Publisher: publish.yml on dhruv13x/duplifinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file duplifinder-11.0.1-py3-none-any.whl.

File metadata

  • Download URL: duplifinder-11.0.1-py3-none-any.whl
  • Upload date:
  • Size: 37.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for duplifinder-11.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6ac484647ebfedf57484c36f09cf35ada60eee30f593b10114e960cd2f68c800
MD5 feb4f214d1f379c2fcd9aa5a8dbd5c7f
BLAKE2b-256 9d9568d2d8317530c09cb68991890c6ba45f81b2ee23dfb91c651d0eb265de39

See more details on using hashes here.

Provenance

The following attestation bundles were made for duplifinder-11.0.1-py3-none-any.whl:

Publisher: publish.yml on dhruv13x/duplifinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page