Code duplication analyzer and refactoring planner for LLMs
Project description
reDUP
Code duplication analyzer and refactoring planner for LLMs.
AI Cost Tracking
- ๐ค LLM usage: $7.5000 (63 commits)
- ๐ค Human dev: ~$1568 (15.7h @ $100/h, 30min dedup)
Generated on 2026-04-16 using openrouter/qwen/qwen3-coder-next
reDUP scans codebases for duplicated functions, blocks, and structural patterns โ then builds a prioritized refactoring map that LLMs can consume to eliminate redundancy systematically.
Features
- Exact duplicate detection via SHA-256 block hashing
- Structural clone detection โ same AST shape, different variable names
- LSH near-duplicate detection for large code blocks (>50 lines)
- Multi-language support โ 35+ languages via tree-sitter (Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, C#, Ruby, PHP, Bash, SQL, HTML, CSS, Lua, Scala, Kotlin, Swift, Objective-C, JSON, YAML, TOML, XML, Markdown, GraphQL, Dockerfile, Makefile, Nginx, Vim, Svelte, Vue, and more)
- Parallel scanning for large projects (2x+ performance improvement)
- Fuzzy near-duplicate matching via SequenceMatcher / rapidfuzz
- Function-level analysis using Python AST and tree-sitter extraction
- Impact scoring โ prioritizes duplicates by
saved_lines ร similarity - Refactoring planner โ generates concrete extract/inline suggestions
- Multiple output formats: JSON, YAML, TOON, Markdown
- Configuration system โ TOML files and environment variables
- CLI commands:
scan,compare,diff,check,config,info - Cross-project comparison โ detect shared code between projects with merge/extract recommendations
- CI integration with configurable quality gates
- Clean output โ no syntax warnings from external libraries
New Features (v0.4.20)
๐ค MCP Server
Full MCP (Model Context Protocol) server for AI assistant integration:
# Start MCP server
redup-mcp
# Or HTTP mode
redup-mcp --transport http --port 8000
Available Tools:
analyze_projectโ Full duplication analysisfind_duplicatesโ Quick duplicate detectioncheck_projectโ Quality gate checkcompare_projectsโ Cross-project comparisonsuggest_refactoringโ AI-powered refactoring suggestionsproject_infoโ Project metadata
๐ Universal Fuzzy Similarity Detection
Cross-language duplicate detection across all 35+ supported languages:
# Detect similar code across languages
redup scan . --fuzzy --fuzzy-threshold 0.65
Cross-Language Matching:
- JavaScript โ Python functions: ~65% similarity
- Docker โ YAML configs: ~40% similarity
- Auth patterns across languages: ~70% similarity
Supported Patterns:
- Functions, classes, API endpoints
- Database queries, web components
- Auth/validation, error handling, logging
- Configuration, infrastructure code
๐ณ Modular Tree-Sitter Extractor
Refactored tree-sitter extraction with clean, modular architecture:
ts_extractor/
โโโ extractors/ # Modular per-language extractors
โ โโโ c_family.py # C, C++, C#, Objective-C
โ โโโ go.py # Go
โ โโโ java.py # Java, Scala, Kotlin
โ โโโ markup.py # HTML, XML, Svelte, Vue
โ โโโ web.py # JavaScript, TypeScript
โ โโโ ...
โโโ dispatcher.py # Smart language routing
โโโ config.py # Language registry
โโโ main.py # Unified API
Benefits:
- Easier to add new languages
- Better testability
- Cleaner separation of concerns
- 35+ languages supported
New Features (v0.5.0+)
๐ Universal Fuzzy Similarity Detection
Cross-language fuzzy matching for detecting similar code patterns across all 35+ supported languages:
# Detect similar patterns across different languages
redup scan . --fuzzy --ext .py,.js,.ts
# Cross-project comparison with fuzzy matching
redup compare ./project-a ./project-b --fuzzy --threshold 0.65
Features:
- Detects similar functions, API endpoints, validation logic across languages (e.g., JS โ Python)
- Pattern recognition: authentication, error handling, database queries, web components
- Language-agnostic signature generation with identifier normalization
- Complexity scoring (0.0-1.0) for each detected pattern
Example patterns detected:
- Express.js route handler โ Flask endpoint (70% similarity)
- Docker Compose service โ Kubernetes deployment (40% similarity)
- Auth middleware patterns across frameworks
๐งฉ Modular ts_extractor Architecture
The tree-sitter multi-language extractor has been refactored from a 782-line god module into a clean package:
redup/core/ts_extractor/
โโโ extractors/
โ โโโ web.py # JavaScript/TypeScript
โ โโโ c_family.py # C/C++
โ โโโ dotnet.py # C#
โ โโโ ruby.py # Ruby
โ โโโ php.py # PHP
โ โโโ ... # 10+ language-specific modules
Benefits:
- Better maintainability (avg 100 lines per module vs 782)
- Easier to add new language extractors
- Shared base utilities for common operations
- Full backward compatibility maintained
๐ฏ Enriched TOON Reporter
The TOON format now includes actionable sections for practical refactoring:
- HOTSPOTS โ Top 7 files with most duplicated lines (where to focus effort)
- QUICK_WINS โ Low-risk, high-savings suggestions (do first)
- DEPENDENCY_RISK โ Duplicates spanning multiple packages (cross-module risk)
- EFFORT_ESTIMATE โ Time estimates per task with difficulty (easy/medium/hard)
๐ค LLM-Powered Refactoring Plans
Generate AI-assisted refactoring TODO lists from cross-project comparisons:
redup compare ./project-a ./project-b --refactor-plan --env .env --output report.json
- Uses
litellmfor flexible LLM provider support - Compact metadata-only prompts for efficiency
- Structured JSON output with prioritized tasks
- Token usage tracking
๐ Simplified Compare Reports
Cross-project comparison reports are now more compact and human-readable:
- Relative file paths instead of absolute
- Matches deduplicated by function pair
- Communities with compact member dicts
- Filtered trivial entries to reduce noise
- ~60% smaller JSON size
Installation
pip install redup
With optional dependencies:
pip install redup[all] # Everything
pip install redup[fuzzy] # rapidfuzz for better similarity matching
pip install redup[ast] # tree-sitter for multi-language AST
pip install redup[lsh] # datasketch for LSH near-duplicate detection
pip install redup[compare] # networkx for cross-project community detection
pip install redup[llm] # litellm for LLM-powered refactoring plans
Quick Start
CLI
# Scan current directory, output TOON to stdout
redup scan .
# Scan with JSON output saved to file
redup scan ./src --format json --output ./reports/
# Parallel scanning for large projects
redup scan . --parallel --max-workers 4
# Multi-language scanning with 35+ supported languages
redup scan . --ext ".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue"
# CI gate with thresholds
redup check . --max-groups 10 --max-lines 100
# Compare two scans
redup diff before.json after.json
# Cross-project comparison (merge vs extract decision)
redup compare ./project-a ./project-b --threshold 0.75
# With LLM-powered refactoring plan (requires litellm + .env with API keys)
redup compare ./project-a ./project-b --refactor-plan --env .env --output comparison.json
# Specify custom LLM model
redup compare ./project-a ./project-b --refactor-plan --llm-model openrouter/anthropic/claude-3.5-sonnet
# Initialize configuration
redup config --init
# Scan with all formats
redup scan . --format all --output ./redup_output/
# Only function-level duplicates (faster)
redup scan . --functions-only
# Custom thresholds
redup scan . --min-lines 5 --min-sim 0.9
# Show installed optional dependencies
redup info
Configuration
Create a redup.toml file:
[scan]
extensions = ".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue"
min_lines = 3
min_similarity = 0.85
include_tests = false
[lsh]
enabled = true
min_lines = 50
threshold = 0.8
[check]
max_groups = 10
max_lines = 100
[output]
format = "toon"
output = "redup_output"
[reporting]
include_snippets = true
generate_suggestions = true
Or use [tool.redup] in pyproject.toml. Environment variables with REDUP_ prefix override file settings.
Python API
from pathlib import Path
from redup import ScanConfig, analyze
from redup.reporters.toon_reporter import to_toon
from redup.reporters.json_reporter import to_json
config = ScanConfig(
root=Path("./my_project"),
extensions=[".py", ".js", ".ts", ".go", ".rs", ".java", ".rb", ".php", ".html", ".css"],
min_block_lines=3,
min_similarity=0.85,
)
result = analyze(config=config, function_level_only=True)
print(f"Found {result.total_groups} duplicate groups")
print(f"Lines recoverable: {result.total_saved_lines}")
# For LLM consumption
print(to_toon(result))
# For tooling / CI
Path("duplication.json").write_text(to_json(result))
Output Formats
TOON (LLM-optimized)
# redup/duplication | 15 groups | 86f 10453L | 2026-04-16
SUMMARY:
files_scanned: 86
total_lines: 10453
dup_groups: 15
dup_fragments: 36
saved_lines: 217
scan_ms: 3620
HOTSPOTS[7] (files with most duplication):
src/redup/core/ts_extractor.py dup=74L groups=4 frags=11 (0.7%)
src/redup/core/scanner_utils.py dup=70L groups=3 frags=3 (0.7%)
src/redup/core/scanner_loader.py dup=52L groups=1 frags=1 (0.5%)
DUPLICATES[15] (ranked by impact):
[E0001] ! EXAC _preload_files L=52 N=2 saved=52 sim=1.00
src/redup/core/scanner_loader.py:9-60 (_preload_files)
src/redup/core/scanner_utils.py:53-104 (_preload_files)
REFACTOR[15] (ranked by priority):
[1] โ extract_module โ src/redup/core/utils/_preload_files.py
WHY: 2 occurrences of 52-line block across 2 files โ saves 52 lines
FILES: src/redup/core/scanner_loader.py, src/redup/core/scanner_utils.py
QUICK_WINS[8] (low risk, high savings โ do first):
[3] extract_function saved=26L โ src/redup/core/utils/find_exact_duplicates_lazy.py
FILES: lazy_grouper.py
[4] extract_function saved=21L โ src/redup/core/utils/_extract_functions_go.py
FILES: ts_extractor.py
DEPENDENCY_RISK[3] (duplicates spanning multiple packages):
validate_input packages=2 files=2
api/routes/users.py
services/auth/validate.py
EFFORT_ESTIMATE (total โ 8.7h):
hard _preload_files saved=52L ~156min
hard __init__ saved=36L ~108min
medium find_exact_duplicates_lazy saved=26L ~52min
easy _is_test_file saved=12L ~24min
METRICS-TARGET:
dup_groups: 15 โ 0
saved_lines: 217 lines recoverable
JSON (machine-readable)
{
"summary": {
"total_groups": 3,
"total_saved_lines": 84
},
"groups": [
{
"id": "E0001",
"type": "exact",
"normalized_name": "calculate_tax",
"fragments": [
{"file": "billing.py", "line_start": 1, "line_end": 8},
{"file": "shipping.py", "line_start": 1, "line_end": 8}
],
"saved_lines_potential": 16
}
],
"refactor_suggestions": [
{
"priority": 1,
"action": "extract_function",
"new_module": "utils/calculate_tax.py",
"risk_level": "low"
}
]
}
Cross-Project Comparison
The redup compare command analyzes two separate projects to detect shared code and recommends a refactoring strategy:
- Merge projects โ if >60% code overlap
- Extract shared library โ if 5-60% overlap with well-defined clusters
- Keep separate โ if <5% overlap
CLI Usage
# Basic comparison
redup compare ./project-a ./project-b --threshold 0.75
# With semantic similarity (slower, more accurate)
redup compare ./project-a ./project-b --semantic --threshold 0.70
# Multi-language projects
redup compare ./backend ./frontend --ext ".py,.js,.ts" --threshold 0.80
# Skip community detection (faster, no networkx required)
redup compare ./a ./b --no-community
# Generate LLM-powered refactoring plan (requires redup[llm])
redup compare ./a ./b --refactor-plan --env .env --output plan.json
Sample Output
Comparing project-a โ project-b (threshold=0.75)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cross-Project Comparison โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Metric โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Project A files โ 42 โ
โ Project B files โ 38 โ
โ Project A lines โ 8500 โ
โ Project B lines โ 7200 โ
โ Cross matches โ 15 โ
โ Shared LOC (potential) โ 1200 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Recommendation: extract_shared_lib
15% overlap (1200 shared lines, 5 clusters). Extract to shared library.
Confidence: 80%
Top Communities (shared code candidates):
โโโโโโณโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโณโโโโโโโโโโโ
โ ID โ Name โ Similarity โ LOC โ Members โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ 0 โ validate_input โ 0.89 โ 180 โ 5 โ
โ 1 โ parse_config โ 0.82 โ 140 โ 4 โ
โ 2 โ format_response โ 0.76 โ 100 โ 3 โ
โโโโโโดโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโดโโโโโโโโโโโ
Report JSON Structure
{
"project_a": "./project-a",
"project_b": "./project-b",
"stats": {
"a": {"files": 42, "lines": 8500},
"b": {"files": 38, "lines": 7200}
},
"total_matches": 15,
"shared_loc_potential": 1200,
"recommendation": {
"decision": "extract_shared_lib",
"rationale": "15% overlap (1200 shared lines, 5 clusters). Extract to shared library.",
"overlap_pct": 0.1523,
"shared_loc": 1200,
"confidence": 0.8
},
"communities": [
{
"name": "validate_input",
"similarity": 0.89,
"loc": 180,
"members": [
{"project": "A", "file": "api/validators.py", "function": "validate_input"},
{"project": "B", "file": "utils/validation.py", "function": "validate_input"}
]
}
],
"matches": [...]
}
Algorithm Overview
The comparison uses a 3-tier similarity detection:
- Structural hash โ exact AST matches (fast, O(n+m))
- LSH (Locality Sensitive Hashing) โ near-duplicates via MinHash
- Semantic similarity โ CodeBERT embeddings (optional, slowest)
Matches are deduplicated by (function_a, function_b, file_a, file_b) with the highest similarity score retained.
Community Detection
Requires networkx (pip install redup[compare]).
Uses greedy modularity communities on a similarity graph where:
- Nodes = functions from both projects
- Edges = similarity score (filtered by
--threshold) - Communities = clusters of mutually similar functions
Each community gets a generated name based on longest common prefix of its member functions (e.g., validate_* โ validate_input).
Architecture
src/redup/
โโโ __init__.py # Public API
โโโ __main__.py # python -m redup
โโโ mcp_server.py # MCP server entry point (re-exports from mcp package)
โโโ mcp/ # MCP server package
โ โโโ __init__.py # Public MCP API
โ โโโ handlers.py # Tool handlers
โ โโโ schemas.py # JSON-RPC schemas
โ โโโ server.py # JSON-RPC server core
โ โโโ utils.py # Shared utilities
โโโ core/
โ โโโ models.py # Pydantic data models
โ โโโ scanner.py # File discovery + block extraction
โ โโโ scanner/ # Scanner package
โ โ โโโ __init__.py # Public scanner API
โ โ โโโ cache.py # Memory cache
โ โ โโโ filters.py # File filtering
โ โ โโโ loader.py # File preloading
โ โ โโโ types.py # Scanner types
โ โโโ hasher.py # SHA-256 / structural fingerprinting
โ โโโ matcher.py # Fuzzy similarity comparison
โ โโโ planner.py # Refactoring suggestion generator
โ โโโ pipeline.py # Legacy: re-exports from pipeline package
โ โโโ pipeline/ # Pipeline package (new)
โ โโโ __init__.py # analyze(), analyze_optimized(), analyze_parallel()
โ โโโ phases.py # scan_phase(), process_blocks()
โ โโโ duplicate_finder.py # Duplicate finding phases
โ โโโ groups.py # Group creation, deduplication
โ โโโ ts_extractor/ # Tree-sitter extraction (35+ languages)
โ โโโ __init__.py # Public API
โ โโโ main.py # Core extraction API
โ โโโ dispatcher.py # Language routing
โ โโโ config.py # Language registry
โ โโโ extractors/ # Per-language extractors
โโโ reporters/
โ โโโ json_reporter.py # JSON output
โ โโโ yaml_reporter.py # YAML output
โ โโโ toon_reporter.py # TOON output (LLM-optimized)
โโโ cli_app/
โโโ main.py # Typer CLI
Analysis Pipeline
1. SCAN Walk project, read files, extract function-level + sliding-window blocks
2. HASH Generate exact (SHA-256) and structural (normalized AST) fingerprints
3. GROUP Bucket by hash, keep only groups with 2+ blocks from different locations
4. MATCH Verify candidates with fuzzy similarity (SequenceMatcher / rapidfuzz)
5. DEDUP Remove overlapping groups (keep highest-impact)
6. PLAN Generate prioritized refactoring suggestions with risk assessment
7. REPORT Export to JSON / YAML / TOON
Recent Improvements (v0.5.0)
๐๏ธ Modular Architecture Refactoring
Major internal restructuring for better maintainability and extensibility:
MCP Server Package
The MCP server has been split from a 675-line monolith into a clean package:
redup/mcp/
โโโ __init__.py # Public API
โโโ handlers.py # 8 tool handlers
โโโ schemas.py # JSON-RPC schemas
โโโ server.py # Server core
โโโ utils.py # Utilities
- 82% code reduction in main file
- Backward compatible:
mcp_server.pyre-exports all APIs - Better testability: Isolated handlers can be tested independently
Pipeline Package
The analysis pipeline (714 lines) now lives in a modular package:
redup/core/pipeline/
โโโ __init__.py # analyze(), analyze_optimized(), analyze_parallel()
โโโ phases.py # scan_phase(), process_blocks()
โโโ duplicate_finder.py # find_exact_groups(), find_structural_groups(), etc.
โโโ groups.py # deduplicate_groups(), blocks_to_group(), etc.
- 66% reduction in main orchestrator file
- Phases can be used independently for custom workflows
- Cleaner separation of concerns
Scanner Improvements
The scanner has been refactored with extracted helpers:
_init_strategy()- Strategy initialization_process_single_file()- Per-file processing_extract_blocks_for_file()- Block extraction- Reduced CC and fan-out in main
scan_project()function
๐ฏ Sprint 1 Refactoring Complete
- Reduced cyclomatic complexity from CCฬ=4.2 to CCฬ=3.5
- Eliminated all critical functions (CC > 10): 2 โ 0
- Achieved HEALTHY status with no structural issues
- Dispatch pattern implementation for AST node processing
- Modular TOON reporter split into 5 focused functions
- CLI refactoring with helper functions for better maintainability
๐ Technical Achievements
_process_ast_node: CC=14 โ CC=6 (dispatch dict pattern)to_toon: CC=12 โ CC=8 (5 helper functions)- CLI
scan(): fan-out=18 โ โค10 (4 helper functions) - Code quality: 0 high-complexity functions
- Test coverage: 64/64 tests passing (100%)
๐ Quality Metrics
- Health status: โ HEALTHY (no critical issues)
- Cyclomatic complexity: CCฬ=3.5 (target โค 3.0 achieved)
- Maximum CC: 9 (target โค 10 achieved)
- Code maintainability: Significantly improved
- Duplication: Minimal (2 groups, 6 lines - acceptable patterns)
๐ง Code Architecture
- Dispatch tables for extensible AST processing
- Single responsibility functions throughout codebase
- Clean separation of concerns in CLI pipeline
- Type safety improvements with proper annotations
- Error handling enhanced for edge cases
Integration with wronai Toolchain
reDUP is part of the wronai developer toolchain:
- code2llm โ static analysis engine (health diagnostics, complexity)
- reDUP โ deep duplication analysis and refactoring planning
- code2docs โ automatic documentation generation
- vallm โ validation of LLM-generated code proposals
๐ Typical workflow:
code2llmanalyzes the project โ.toondiagnosticsredupfinds duplicates โduplication.toon.yaml- Feed both to an LLM for targeted refactoring
vallmvalidates the LLM's proposals before merging
๐ฏ Why reDUP?
- LLM-ready: TOON format optimized for LLM consumption
- Actionable: Generates concrete refactoring suggestions
- Prioritized: Ranks duplicates by impact and risk
- Integrated: Works seamlessly with wronai toolchain
- Fast: Scans 1000+ lines in < 1 second
- Clean: No syntax warnings, professional output
Development
git clone https://github.com/semcod/redup.git
cd redup
pip install -e ".[dev]"
pytest
License
Licensed under Apache-2.0.
Author
Tom Sapletta
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redup-0.4.22.tar.gz.
File metadata
- Download URL: redup-0.4.22.tar.gz
- Upload date:
- Size: 119.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ea827eaaa41c550abab4a79018f5a3198615f865204d27e2080c735ceed086f
|
|
| MD5 |
cccad17251e1be4a03c7c5a2ca5f57ba
|
|
| BLAKE2b-256 |
f1422daf8cdc0160dc5791584fe4435f204174b2dcda3b7569e2bac4c111f242
|
File details
Details for the file redup-0.4.22-py3-none-any.whl.
File metadata
- Download URL: redup-0.4.22-py3-none-any.whl
- Upload date:
- Size: 131.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04602349d077cdcc4b9bc5e7e744b946c8e18c875fc0510553323ffae83d3789
|
|
| MD5 |
3d2d9abcdcd9d6ce2c4b839ea407df03
|
|
| BLAKE2b-256 |
ea231d59721bc8823407ef4abc2312ed8f341646ca6636b91a0df2684d4debd7
|