Code duplication analyzer and refactoring planner for LLMs
Project description
reDUP
Code duplication analyzer and refactoring planner for LLMs.
reDUP scans codebases for duplicated functions, blocks, and structural patterns — then builds a prioritized refactoring map that LLMs can consume to eliminate redundancy systematically.
Features
- Exact duplicate detection via SHA-256 block hashing
- Structural clone detection — same AST shape, different variable names
- Fuzzy near-duplicate matching via SequenceMatcher / rapidfuzz
- Function-level analysis using Python AST extraction
- Impact scoring — prioritizes duplicates by
saved_lines × similarity - Refactoring planner — generates concrete extract/inline suggestions
- Three output formats: JSON (tooling), YAML (humans), TOON (LLMs)
- CLI with
typer+richfor interactive use - Clean output — no syntax warnings from external libraries
- Optimized performance — reduced complexity and improved maintainability
Installation
pip install redup
With optional dependencies:
pip install redup[all] # Everything
pip install redup[fuzzy] # rapidfuzz for better similarity matching
pip install redup[ast] # tree-sitter for multi-language AST
pip install redup[lsh] # datasketch for LSH near-duplicate detection
Quick Start
CLI
# Scan current directory, output TOON to stdout
redup scan .
# Scan with JSON output saved to file
redup scan ./src --format json --output ./reports/
# Scan with all formats
redup scan . --format all --output ./redup_output/
# Only function-level duplicates (faster)
redup scan . --functions-only
# Custom thresholds
redup scan . --min-lines 5 --min-sim 0.9
# Show installed optional dependencies
redup info
Python API
from pathlib import Path
from redup import ScanConfig, analyze
from redup.reporters.toon_reporter import to_toon
from redup.reporters.json_reporter import to_json
config = ScanConfig(
root=Path("./my_project"),
extensions=[".py"],
min_block_lines=3,
min_similarity=0.85,
)
result = analyze(config=config, function_level_only=True)
print(f"Found {result.total_groups} duplicate groups")
print(f"Lines recoverable: {result.total_saved_lines}")
# For LLM consumption
print(to_toon(result))
# For tooling / CI
Path("duplication.json").write_text(to_json(result))
Output Formats
TOON (LLM-optimized)
# redup/duplication | 3 groups | 12f 4200L | 2026-03-22
SUMMARY:
files_scanned: 12
total_lines: 4200
dup_groups: 3
saved_lines: 84
DUPLICATES[3] (ranked by impact):
[E0001] !! EXAC calculate_tax L=8 N=3 saved=16 sim=1.00
billing.py:1-8 (calculate_tax)
shipping.py:1-8 (calculate_tax)
returns.py:1-8 (calculate_tax)
REFACTOR[1] (ranked by priority):
[1] ○ extract_function → utils/calculate_tax.py
WHY: 3 occurrences of 8-line block across 3 files — saves 16 lines
FILES: billing.py, shipping.py, returns.py
JSON (machine-readable)
{
"summary": {
"total_groups": 3,
"total_saved_lines": 84
},
"groups": [
{
"id": "E0001",
"type": "exact",
"normalized_name": "calculate_tax",
"fragments": [
{"file": "billing.py", "line_start": 1, "line_end": 8},
{"file": "shipping.py", "line_start": 1, "line_end": 8}
],
"saved_lines_potential": 16
}
],
"refactor_suggestions": [
{
"priority": 1,
"action": "extract_function",
"new_module": "utils/calculate_tax.py",
"risk_level": "low"
}
]
}
Architecture
src/redup/
├── __init__.py # Public API
├── __main__.py # python -m redup
├── core/
│ ├── models.py # Pydantic data models
│ ├── scanner.py # File discovery + block extraction
│ ├── hasher.py # SHA-256 / structural fingerprinting
│ ├── matcher.py # Fuzzy similarity comparison
│ ├── planner.py # Refactoring suggestion generator
│ └── pipeline.py # Orchestrator: scan → hash → match → plan
├── reporters/
│ ├── json_reporter.py # JSON output
│ ├── yaml_reporter.py # YAML output
│ └── toon_reporter.py # TOON output (LLM-optimized)
└── cli_app/
└── main.py # Typer CLI
Analysis Pipeline
1. SCAN Walk project, read files, extract function-level + sliding-window blocks
2. HASH Generate exact (SHA-256) and structural (normalized AST) fingerprints
3. GROUP Bucket by hash, keep only groups with 2+ blocks from different locations
4. MATCH Verify candidates with fuzzy similarity (SequenceMatcher / rapidfuzz)
5. DEDUP Remove overlapping groups (keep highest-impact)
6. PLAN Generate prioritized refactoring suggestions with risk assessment
7. REPORT Export to JSON / YAML / TOON
Recent Improvements (v0.1.8)
🎯 Complexity Reduction
- Reduced cyclomatic complexity from CC̄=4.8 to CC̄=4.4
- Eliminated high-complexity functions (CC > 15)
- Modularized
analyze()function into 7 focused helpers - Refactored
_ast_to_normalized_string()into 3 specialized functions - Improved code maintainability and testability
🚀 Performance & UX
- Clean output — no syntax warnings from external libraries
- Optimized imports and code organization
- Enhanced error handling for edge cases
- Better type hints with
Callable[[str], str]patterns - Streamlined path operations using
os.path.commonpath
📊 Quality Metrics
- Health status: ✅ HEALTHY (no critical issues)
- Test coverage: 64/64 tests passing
- Code quality: 0 high-complexity functions
- Duplication: Minimal (2 groups, 6 lines)
Integration with wronai Toolchain
reDUP is part of the wronai developer toolchain:
- code2llm — static analysis engine (health diagnostics, complexity)
- reDUP — deep duplication analysis and refactoring planning
- code2docs — automatic documentation generation
- vallm — validation of LLM-generated code proposals
📈 Typical workflow:
code2llmanalyzes the project →.toondiagnosticsredupfinds duplicates →duplication.toon- Feed both to an LLM for targeted refactoring
vallmvalidates the LLM's proposals before merging
🎯 Why reDUP?
- LLM-ready: TOON format optimized for LLM consumption
- Actionable: Generates concrete refactoring suggestions
- Prioritized: Ranks duplicates by impact and risk
- Integrated: Works seamlessly with wronai toolchain
- Fast: Scans 1000+ lines in < 1 second
- Clean: No syntax warnings, professional output
Development
git clone https://github.com/semcod/redup.git
cd redup
pip install -e ".[dev]"
pytest
License
Apache License 2.0 - see LICENSE for details.
Author
Created by Tom Sapletta - tom@sapletta.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redup-0.1.10.tar.gz.
File metadata
- Download URL: redup-0.1.10.tar.gz
- Upload date:
- Size: 31.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d48bc2a32010d89e97d215e3427208ceb4bd35338987a146c1053de86ffa0e74
|
|
| MD5 |
93d183dfffc74820379d1eac06df0134
|
|
| BLAKE2b-256 |
a04989095ea919bf8baf221009ecc872aada3170b4093990f912fa00470570d6
|
File details
Details for the file redup-0.1.10-py3-none-any.whl.
File metadata
- Download URL: redup-0.1.10-py3-none-any.whl
- Upload date:
- Size: 26.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e54d0c397c91936519faee2fed35958ab5bac4aaaf3f1b8385d853aafa305f77
|
|
| MD5 |
29fb4fbe1694fee02ed4513aa8993233
|
|
| BLAKE2b-256 |
cef65550db785a3bb834556871e32898a918c3b740f921a35cc828842b96b6c1
|