Skip to main content

Detect duplicate Python definitions, text patterns, and token similarities for codebase maintainability.

Project description

duplifinder logo

PyPI version Python Smart Update Wheel Release

Build status Codecov Test Coverage Code style: black Ruff Security

Maintenance License: MIT

Duplifinder

The "Batteries Included" duplicate code detector. Detect and refactor duplicate Python classes, functions, and async defs—plus text and tokens across other languages—to keep your codebase lean and mean.


⚡ Quick Start (The "5-Minute Rule")

Prerequisites

  • Python 3.12+
  • pip (or uv/poetry)

Installation

pip install duplifinder

Usage Example

Get instant feedback on your current directory:

# Standard scan (AST + Token)
duplifinder .

# Watch mode for live feedback (Best for dev loop)
duplifinder . --watch --preview

# Scan with parallel processing and detailed audit logs
duplifinder src/ --parallel --audit --verbose

Pre-commit Hook

Add to your .pre-commit-config.yaml to block duplicates before they merge:

-   repo: https://github.com/dhruv13x/duplifinder
    rev: v11.0.0  # Use latest version
    hooks:
    -   id: duplifinder
        args: ["--fail", "--dup-threshold=0.05"]

✨ Features (The "Why")

Core Capabilities

  • AST-Powered Detection: Precision finding for ClassDef, FunctionDef, and AsyncFunctionDef (Python). It sees through variable name changes.
  • Multi-Language Support: Token and text-based similarity checks for Python, JavaScript, TypeScript, and Java.
  • Smart Watch Mode: "Live" scanning that updates results instantly as you modify files.

Performance & Security

  • Parallel Processing: Blazing fast scans using multi-threading or multi-processing (GIL-aware) with --parallel and --use-multiprocessing.
  • Smart Caching: Skips unchanged files to dramatically speed up re-scans.
  • Audit Logging: Enterprise-grade JSONL trails for file access and scan operations.

Developer Experience

  • Automated Refactoring Suggestions: "God Level" advice—tells you how to fix the duplication (e.g., "Extract to shared utility").
  • Rich Reporting: Beautiful console tables, JSON output for CI/CD, and formatted previews.

🛠️ Configuration (The "How")

Customize behavior via CLI flags or a .duplifinder.yaml file.

CLI Reference

Flag Description Default
<root> Positional argument: Root directory to scan. .
--config Path to a YAML configuration file. None
--watch Live scanning on file changes. False
--parallel Enable parallel file scanning (threading). False
--use-multiprocessing Use CPU cores (true parallelism) instead of threads. False
--max-workers Limit the number of parallel workers. Auto
--fail Exit with code 1 if duplicates found (CI mode). False
--json Output results in JSON format. False
-p, --preview Show the actual code snippets in the output. False
--audit Enable audit logging to file. False
--audit-log Path for the audit log file. .duplifinder_audit.jsonl
--token-mode Enable token-based fuzzy matching. False
--similarity-threshold Sensitivity for token matching (0.0 - 1.0). 0.8
--dup-threshold Alert if duplication rate exceeds this ratio. 0.1
-f, --find Specific types to find (class, def, async_def). All
--exclude-patterns Glob patterns to exclude (e.g., */migrations/*). None
--exclude-names Regex patterns for definition names to exclude. None
--no-gitignore Do NOT respect .gitignore files. False
--version Show version information. -

Configuration File (.duplifinder.yaml)

You can also use .duplifinder.yaml. The CLI args override these settings.

Key Description Default
root Root directory to scan .
ignore Comma-separated directory names to ignore .git, venv, etc.
exclude_patterns List of glob patterns to exclude []
token_mode Enable token-based fuzzy matching false
similarity_threshold Sensitivity for token matching 0.8
dup_threshold Duplication rate threshold for alerts 0.1
audit Enable audit logging false
parallel Enable parallel scanning false
watch Enable watch mode false

Note: Environment variables are not currently supported for configuration to ensure reproducibility via code.

# Example .duplifinder.yaml
root: src
ignore: "tests,legacy"
exclude_patterns: "*/migrations/*"
token_mode: true
similarity_threshold: 0.85
audit: true
parallel: true

🏗️ Architecture

Duplifinder uses a Strategy pattern to dispatch scanners based on file type and mode.

Directory Tree

src/duplifinder/
├── application.py       # Workflow orchestration
├── cli.py               # Argument parsing
├── config.py            # Pydantic configuration & validation
├── finder.py            # Strategy Dispatcher
├── definition_finder.py # AST-based Logic (Python)
├── token_finder.py      # Token-based Similarity (Multi-lang)
├── text_finder.py       # Regex Pattern Matcher
├── refactoring.py       # Refactoring Suggestion Engine
├── processors.py        # File I/O & Parallel Processing
├── output.py            # Rich Console & JSON Renderers
├── utils.py             # File discovery & Audit logging
└── watcher.py           # Watchdog event handling

Data Flow

  1. Discovery: utils.py recursively finds files, respecting .gitignore.
  2. Dispatch: finder.py selects the right strategy (AST, Token, or Text) based on file extension.
  3. Analysis: processors.py runs in parallel to extract definitions or tokens.
  4. Comparison: Hashes or token vectors are compared to find duplicates.
  5. Refactoring: refactoring.py analyzes results to generate actionable fixes.
  6. Reporting: Results are streamed to Console (using Rich), JSON, or HTML.

🐞 Troubleshooting

Issue Likely Cause Solution
No duplicates found Thresholds too high or wrong path. Lower --similarity-threshold (e.g., 0.6) or check <root>.
Scanning is slow Large vendor directories. Add folders to --ignore or .gitignore (e.g., node_modules, venv).
Memory usage high Very large files or too many threads. Reduce --max-workers or use --exclude-patterns for large generated files.
"Config validation failed" Invalid .yaml or args. Check error message and compare with CLI Reference.

Debug Mode: Run with --verbose to see detailed logs and performance metrics.


🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details on how to get started.

Dev Setup

  1. Clone the repo.
  2. Install dependencies: pip install -e ".[dev]"
  3. Run tests: pytest
  4. Linting: ruff check .

🗺️ Roadmap

See ROADMAP.md for the full vision.

  • Foundation: AST Detection, Parallelism, Rich Output.
  • Standard: Watch Mode, Refactoring Suggestions, Multi-language.
  • 🚧 Ecosystem (Next): IDE Plugins, GitHub Action, Webhooks.
  • 🔮 Vision: AI-Powered Refactoring, Cross-Repo Analysis.

Built with 💙 by Dhruv & the Open Source Community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duplifinder-12.0.0.tar.gz (49.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duplifinder-12.0.0-py3-none-any.whl (38.3 kB view details)

Uploaded Python 3

File details

Details for the file duplifinder-12.0.0.tar.gz.

File metadata

  • Download URL: duplifinder-12.0.0.tar.gz
  • Upload date:
  • Size: 49.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for duplifinder-12.0.0.tar.gz
Algorithm Hash digest
SHA256 61c4802c598fb2c371b2173c0c6a1361c0f591fe8ff5ae381f0d8144cfdf31f6
MD5 a1f4ad50fdb7e22738dc15a395e4b540
BLAKE2b-256 fd9a0227b99b8b06dd87e5079d7f5bc41c7ef163e54c8f0f9662b4cfa71f341e

See more details on using hashes here.

Provenance

The following attestation bundles were made for duplifinder-12.0.0.tar.gz:

Publisher: publish.yml on dhruv13x/duplifinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file duplifinder-12.0.0-py3-none-any.whl.

File metadata

  • Download URL: duplifinder-12.0.0-py3-none-any.whl
  • Upload date:
  • Size: 38.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for duplifinder-12.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 57d469f68c33b71bcd11d93ad8102b337b39414410ac1493c4f2a4be88ceacaa
MD5 9529cf9dd58ff2c77c049bca56131e74
BLAKE2b-256 bed8c76a28a968566728e55892e4afbb9a2bd30d82aba5c41b04a6716928e056

See more details on using hashes here.

Provenance

The following attestation bundles were made for duplifinder-12.0.0-py3-none-any.whl:

Publisher: publish.yml on dhruv13x/duplifinder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page