Detect duplicate Python definitions, text patterns, and token similarities for codebase maintainability.
Project description
Duplifinder
The "Batteries Included" duplicate code detector. Detect and refactor duplicate Python classes, functions, and async defs—plus text and tokens across other languages—to keep your codebase lean and mean.
⚡ Quick Start (The "5-Minute Rule")
Prerequisites
- Python 3.12+
pip(oruv/poetry)
Installation
pip install duplifinder
Usage Example
Get instant feedback on your current directory:
# Standard scan (AST + Token)
duplifinder .
# Watch mode for live feedback (Best for dev loop)
duplifinder . --watch --preview
# Scan with parallel processing and detailed audit logs
duplifinder src/ --parallel --audit --verbose
Pre-commit Hook
Add to your .pre-commit-config.yaml to block duplicates before they merge:
- repo: https://github.com/dhruv13x/duplifinder
rev: v11.0.0 # Use latest version
hooks:
- id: duplifinder
args: ["--fail", "--dup-threshold=0.05"]
✨ Features (The "Why")
Core Capabilities
- AST-Powered Detection: Precision finding for
ClassDef,FunctionDef, andAsyncFunctionDef(Python). It sees through variable name changes. - Multi-Language Support: Token and text-based similarity checks for Python, JavaScript, TypeScript, and Java.
- Smart Watch Mode: "Live" scanning that updates results instantly as you modify files.
Performance & Security
- Parallel Processing: Blazing fast scans using multi-threading or multi-processing (GIL-aware) with
--paralleland--use-multiprocessing. - Smart Caching: Skips unchanged files to dramatically speed up re-scans.
- Audit Logging: Enterprise-grade JSONL trails for file access and scan operations.
Developer Experience
- Automated Refactoring Suggestions: "God Level" advice—tells you how to fix the duplication (e.g., "Extract to shared utility").
- Rich Reporting: Beautiful console tables, JSON output for CI/CD, and formatted previews.
🛠️ Configuration (The "How")
Customize behavior via CLI flags or a .duplifinder.yaml file.
CLI Reference
| Flag | Description | Default |
|---|---|---|
<root> |
Positional argument: Root directory to scan. | . |
--config |
Path to a YAML configuration file. | None |
--watch |
Live scanning on file changes. | False |
--parallel |
Enable parallel file scanning (threading). | False |
--use-multiprocessing |
Use CPU cores (true parallelism) instead of threads. | False |
--max-workers |
Limit the number of parallel workers. | Auto |
--fail |
Exit with code 1 if duplicates found (CI mode). | False |
--json |
Output results in JSON format. | False |
-p, --preview |
Show the actual code snippets in the output. | False |
--audit |
Enable audit logging to file. | False |
--audit-log |
Path for the audit log file. | .duplifinder_audit.jsonl |
--token-mode |
Enable token-based fuzzy matching. | False |
--similarity-threshold |
Sensitivity for token matching (0.0 - 1.0). | 0.8 |
--dup-threshold |
Alert if duplication rate exceeds this ratio. | 0.1 |
-f, --find |
Specific types to find (class, def, async_def). | All |
--exclude-patterns |
Glob patterns to exclude (e.g., */migrations/*). |
None |
--exclude-names |
Regex patterns for definition names to exclude. | None |
--no-gitignore |
Do NOT respect .gitignore files. | False |
--version |
Show version information. | - |
Configuration File (.duplifinder.yaml)
You can also use .duplifinder.yaml. The CLI args override these settings.
| Key | Description | Default |
|---|---|---|
root |
Root directory to scan | . |
ignore |
Comma-separated directory names to ignore | .git, venv, etc. |
exclude_patterns |
List of glob patterns to exclude | [] |
token_mode |
Enable token-based fuzzy matching | false |
similarity_threshold |
Sensitivity for token matching | 0.8 |
dup_threshold |
Duplication rate threshold for alerts | 0.1 |
audit |
Enable audit logging | false |
parallel |
Enable parallel scanning | false |
watch |
Enable watch mode | false |
Note: Environment variables are not currently supported for configuration to ensure reproducibility via code.
# Example .duplifinder.yaml
root: src
ignore: "tests,legacy"
exclude_patterns: "*/migrations/*"
token_mode: true
similarity_threshold: 0.85
audit: true
parallel: true
🏗️ Architecture
Duplifinder uses a Strategy pattern to dispatch scanners based on file type and mode.
Directory Tree
src/duplifinder/
├── application.py # Workflow orchestration
├── cli.py # Argument parsing
├── config.py # Pydantic configuration & validation
├── finder.py # Strategy Dispatcher
├── definition_finder.py # AST-based Logic (Python)
├── token_finder.py # Token-based Similarity (Multi-lang)
├── text_finder.py # Regex Pattern Matcher
├── refactoring.py # Refactoring Suggestion Engine
├── processors.py # File I/O & Parallel Processing
├── output.py # Rich Console & JSON Renderers
├── utils.py # File discovery & Audit logging
└── watcher.py # Watchdog event handling
Data Flow
- Discovery:
utils.pyrecursively finds files, respecting.gitignore. - Dispatch:
finder.pyselects the right strategy (AST, Token, or Text) based on file extension. - Analysis:
processors.pyruns in parallel to extract definitions or tokens. - Comparison: Hashes or token vectors are compared to find duplicates.
- Refactoring:
refactoring.pyanalyzes results to generate actionable fixes. - Reporting: Results are streamed to Console (using
Rich), JSON, or HTML.
🐞 Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| No duplicates found | Thresholds too high or wrong path. | Lower --similarity-threshold (e.g., 0.6) or check <root>. |
| Scanning is slow | Large vendor directories. | Add folders to --ignore or .gitignore (e.g., node_modules, venv). |
| Memory usage high | Very large files or too many threads. | Reduce --max-workers or use --exclude-patterns for large generated files. |
| "Config validation failed" | Invalid .yaml or args. |
Check error message and compare with CLI Reference. |
Debug Mode: Run with --verbose to see detailed logs and performance metrics.
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for details on how to get started.
Dev Setup
- Clone the repo.
- Install dependencies:
pip install -e ".[dev]" - Run tests:
pytest - Linting:
ruff check .
🗺️ Roadmap
See ROADMAP.md for the full vision.
- ✅ Foundation: AST Detection, Parallelism, Rich Output.
- ✅ Standard: Watch Mode, Refactoring Suggestions, Multi-language.
- 🚧 Ecosystem (Next): IDE Plugins, GitHub Action, Webhooks.
- 🔮 Vision: AI-Powered Refactoring, Cross-Repo Analysis.
Built with 💙 by Dhruv & the Open Source Community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file duplifinder-11.0.1.tar.gz.
File metadata
- Download URL: duplifinder-11.0.1.tar.gz
- Upload date:
- Size: 48.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5242c0d126a773d02ea045019eace8fa3bdf222db5a3bb6b504a1a9b653e991c
|
|
| MD5 |
5c56f6f1bd6c645a5af0506139b9fbc1
|
|
| BLAKE2b-256 |
5b4f847775184837530ed6b5b94370e163ac06b2696387cb917ebea7bae7b65b
|
Provenance
The following attestation bundles were made for duplifinder-11.0.1.tar.gz:
Publisher:
publish.yml on dhruv13x/duplifinder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
duplifinder-11.0.1.tar.gz -
Subject digest:
5242c0d126a773d02ea045019eace8fa3bdf222db5a3bb6b504a1a9b653e991c - Sigstore transparency entry: 764297726
- Sigstore integration time:
-
Permalink:
dhruv13x/duplifinder@52e84d99222149431f0f6901bb8e8b4ab9e38476 -
Branch / Tag:
refs/tags/v11.0.1 - Owner: https://github.com/dhruv13x
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@52e84d99222149431f0f6901bb8e8b4ab9e38476 -
Trigger Event:
push
-
Statement type:
File details
Details for the file duplifinder-11.0.1-py3-none-any.whl.
File metadata
- Download URL: duplifinder-11.0.1-py3-none-any.whl
- Upload date:
- Size: 37.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ac484647ebfedf57484c36f09cf35ada60eee30f593b10114e960cd2f68c800
|
|
| MD5 |
feb4f214d1f379c2fcd9aa5a8dbd5c7f
|
|
| BLAKE2b-256 |
9d9568d2d8317530c09cb68991890c6ba45f81b2ee23dfb91c651d0eb265de39
|
Provenance
The following attestation bundles were made for duplifinder-11.0.1-py3-none-any.whl:
Publisher:
publish.yml on dhruv13x/duplifinder
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
duplifinder-11.0.1-py3-none-any.whl -
Subject digest:
6ac484647ebfedf57484c36f09cf35ada60eee30f593b10114e960cd2f68c800 - Sigstore transparency entry: 764297729
- Sigstore integration time:
-
Permalink:
dhruv13x/duplifinder@52e84d99222149431f0f6901bb8e8b4ab9e38476 -
Branch / Tag:
refs/tags/v11.0.1 - Owner: https://github.com/dhruv13x
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@52e84d99222149431f0f6901bb8e8b4ab9e38476 -
Trigger Event:
push
-
Statement type: