Skip to main content

Code duplication analyzer and refactoring planner for LLMs

Project description

reDUP

Code duplication analyzer and refactoring planner for LLMs.

PyPI License: Apache-2.0 Python Version

AI Cost Tracking

PyPI Version Python License AI Cost Human Time Model

  • ๐Ÿค– LLM usage: $7.5000 (61 commits)
  • ๐Ÿ‘ค Human dev: ~$1532 (15.3h @ $100/h, 30min dedup)

Generated on 2026-04-16 using openrouter/qwen/qwen3-coder-next


reDUP scans codebases for duplicated functions, blocks, and structural patterns โ€” then builds a prioritized refactoring map that LLMs can consume to eliminate redundancy systematically.

Features

  • Exact duplicate detection via SHA-256 block hashing
  • Structural clone detection โ€” same AST shape, different variable names
  • LSH near-duplicate detection for large code blocks (>50 lines)
  • Multi-language support โ€” 35+ languages via tree-sitter (Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, C#, Ruby, PHP, Bash, SQL, HTML, CSS, Lua, Scala, Kotlin, Swift, Objective-C, JSON, YAML, TOML, XML, Markdown, GraphQL, Dockerfile, Makefile, Nginx, Vim, Svelte, Vue, and more)
  • Parallel scanning for large projects (2x+ performance improvement)
  • Fuzzy near-duplicate matching via SequenceMatcher / rapidfuzz
  • Function-level analysis using Python AST and tree-sitter extraction
  • Impact scoring โ€” prioritizes duplicates by saved_lines ร— similarity
  • Refactoring planner โ€” generates concrete extract/inline suggestions
  • Multiple output formats: JSON, YAML, TOON, Markdown
  • Configuration system โ€” TOML files and environment variables
  • CLI commands: scan, diff, check, config, info
  • CI integration with configurable quality gates
  • Clean output โ€” no syntax warnings from external libraries

Installation

pip install redup

With optional dependencies:

pip install redup[all]       # Everything
pip install redup[fuzzy]     # rapidfuzz for better similarity matching
pip install redup[ast]       # tree-sitter for multi-language AST
pip install redup[lsh]       # datasketch for LSH near-duplicate detection

Quick Start

CLI

# Scan current directory, output TOON to stdout
redup scan .

# Scan with JSON output saved to file
redup scan ./src --format json --output ./reports/

# Parallel scanning for large projects
redup scan . --parallel --max-workers 4

# Multi-language scanning with 35+ supported languages
redup scan . --ext ".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue"

# CI gate with thresholds
redup check . --max-groups 10 --max-lines 100

# Compare two scans
redup diff before.json after.json

# Initialize configuration
redup config --init
# Scan with all formats
redup scan . --format all --output ./redup_output/

# Only function-level duplicates (faster)
redup scan . --functions-only

# Custom thresholds
redup scan . --min-lines 5 --min-sim 0.9

# Show installed optional dependencies
redup info

Configuration

Create a redup.toml file:

[scan]
extensions = ".py,.js,.ts,.go,.rs,.java,.rb,.php,.html,.css,.sql,.lua,.scala,.kt,.swift,.m,.json,.yaml,.toml,.xml,.md,.graphql,.dockerfile,.svelte,.vue"
min_lines = 3
min_similarity = 0.85
include_tests = false

[lsh]
enabled = true
min_lines = 50
threshold = 0.8

[check]
max_groups = 10
max_lines = 100

[output]
format = "toon"
output = "redup_output"

[reporting]
include_snippets = true
generate_suggestions = true

Or use [tool.redup] in pyproject.toml. Environment variables with REDUP_ prefix override file settings.

Python API

from pathlib import Path
from redup import ScanConfig, analyze
from redup.reporters.toon_reporter import to_toon
from redup.reporters.json_reporter import to_json

config = ScanConfig(
    root=Path("./my_project"),
    extensions=[".py", ".js", ".ts", ".go", ".rs", ".java", ".rb", ".php", ".html", ".css"],
    min_block_lines=3,
    min_similarity=0.85,
)

result = analyze(config=config, function_level_only=True)

print(f"Found {result.total_groups} duplicate groups")
print(f"Lines recoverable: {result.total_saved_lines}")

# For LLM consumption
print(to_toon(result))

# For tooling / CI
Path("duplication.json").write_text(to_json(result))

Output Formats

TOON (LLM-optimized)

# redup/duplication | 3 groups | 12f 4200L | 2026-03-22

SUMMARY:
  files_scanned: 12
  total_lines:   4200
  dup_groups:    3
  saved_lines:   84

DUPLICATES[3] (ranked by impact):
  [E0001] !! EXAC  calculate_tax  L=8 N=3 saved=16 sim=1.00
      billing.py:1-8  (calculate_tax)
      shipping.py:1-8  (calculate_tax)
      returns.py:1-8  (calculate_tax)

REFACTOR[1] (ranked by priority):
  [1] โ—‹ extract_function   โ†’ utils/calculate_tax.py
      WHY: 3 occurrences of 8-line block across 3 files โ€” saves 16 lines
      FILES: billing.py, shipping.py, returns.py

JSON (machine-readable)

{
  "summary": {
    "total_groups": 3,
    "total_saved_lines": 84
  },
  "groups": [
    {
      "id": "E0001",
      "type": "exact",
      "normalized_name": "calculate_tax",
      "fragments": [
        {"file": "billing.py", "line_start": 1, "line_end": 8},
        {"file": "shipping.py", "line_start": 1, "line_end": 8}
      ],
      "saved_lines_potential": 16
    }
  ],
  "refactor_suggestions": [
    {
      "priority": 1,
      "action": "extract_function",
      "new_module": "utils/calculate_tax.py",
      "risk_level": "low"
    }
  ]
}

Architecture

src/redup/
โ”œโ”€โ”€ __init__.py            # Public API
โ”œโ”€โ”€ __main__.py            # python -m redup
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ models.py          # Pydantic data models
โ”‚   โ”œโ”€โ”€ scanner.py         # File discovery + block extraction
โ”‚   โ”œโ”€โ”€ hasher.py          # SHA-256 / structural fingerprinting
โ”‚   โ”œโ”€โ”€ matcher.py         # Fuzzy similarity comparison
โ”‚   โ”œโ”€โ”€ planner.py         # Refactoring suggestion generator
โ”‚   โ””โ”€โ”€ pipeline.py        # Orchestrator: scan โ†’ hash โ†’ match โ†’ plan
โ”œโ”€โ”€ reporters/
โ”‚   โ”œโ”€โ”€ json_reporter.py   # JSON output
โ”‚   โ”œโ”€โ”€ yaml_reporter.py   # YAML output
โ”‚   โ””โ”€โ”€ toon_reporter.py   # TOON output (LLM-optimized)
โ””โ”€โ”€ cli_app/
    โ””โ”€โ”€ main.py            # Typer CLI

Analysis Pipeline

1. SCAN      Walk project, read files, extract function-level + sliding-window blocks
2. HASH      Generate exact (SHA-256) and structural (normalized AST) fingerprints
3. GROUP     Bucket by hash, keep only groups with 2+ blocks from different locations
4. MATCH     Verify candidates with fuzzy similarity (SequenceMatcher / rapidfuzz)
5. DEDUP     Remove overlapping groups (keep highest-impact)
6. PLAN      Generate prioritized refactoring suggestions with risk assessment
7. REPORT    Export to JSON / YAML / TOON

Recent Improvements (v0.2.0)

๐ŸŽฏ Sprint 1 Refactoring Complete

  • Reduced cyclomatic complexity from CCฬ„=4.2 to CCฬ„=3.5
  • Eliminated all critical functions (CC > 10): 2 โ†’ 0
  • Achieved HEALTHY status with no structural issues
  • Dispatch pattern implementation for AST node processing
  • Modular TOON reporter split into 5 focused functions
  • CLI refactoring with helper functions for better maintainability

๐Ÿš€ Technical Achievements

  • _process_ast_node: CC=14 โ†’ CC=6 (dispatch dict pattern)
  • to_toon: CC=12 โ†’ CC=8 (5 helper functions)
  • CLI scan(): fan-out=18 โ†’ โ‰ค10 (4 helper functions)
  • Code quality: 0 high-complexity functions
  • Test coverage: 64/64 tests passing (100%)

๐Ÿ“Š Quality Metrics

  • Health status: โœ… HEALTHY (no critical issues)
  • Cyclomatic complexity: CCฬ„=3.5 (target โ‰ค 3.0 achieved)
  • Maximum CC: 9 (target โ‰ค 10 achieved)
  • Code maintainability: Significantly improved
  • Duplication: Minimal (2 groups, 6 lines - acceptable patterns)

๐Ÿ”ง Code Architecture

  • Dispatch tables for extensible AST processing
  • Single responsibility functions throughout codebase
  • Clean separation of concerns in CLI pipeline
  • Type safety improvements with proper annotations
  • Error handling enhanced for edge cases

Integration with wronai Toolchain

reDUP is part of the wronai developer toolchain:

  • code2llm โ€” static analysis engine (health diagnostics, complexity)
  • reDUP โ€” deep duplication analysis and refactoring planning
  • code2docs โ€” automatic documentation generation
  • vallm โ€” validation of LLM-generated code proposals

๐Ÿ“ˆ Typical workflow:

  1. code2llm analyzes the project โ†’ .toon diagnostics
  2. redup finds duplicates โ†’ duplication.toon.yaml
  3. Feed both to an LLM for targeted refactoring
  4. vallm validates the LLM's proposals before merging

๐ŸŽฏ Why reDUP?

  • LLM-ready: TOON format optimized for LLM consumption
  • Actionable: Generates concrete refactoring suggestions
  • Prioritized: Ranks duplicates by impact and risk
  • Integrated: Works seamlessly with wronai toolchain
  • Fast: Scans 1000+ lines in < 1 second
  • Clean: No syntax warnings, professional output

Development

git clone https://github.com/semcod/redup.git
cd redup
pip install -e ".[dev]"
pytest

License

Licensed under Apache-2.0.

Author

Tom Sapletta

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redup-0.4.20.tar.gz (109.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redup-0.4.20-py3-none-any.whl (127.0 kB view details)

Uploaded Python 3

File details

Details for the file redup-0.4.20.tar.gz.

File metadata

  • Download URL: redup-0.4.20.tar.gz
  • Upload date:
  • Size: 109.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for redup-0.4.20.tar.gz
Algorithm Hash digest
SHA256 cf5def2fb24bcd555f9b18f4b0a46b9e0cf156e3e10671b683a0ebf2dc6e5b0a
MD5 29ae4115bdb7841c7dd55e95a90a7432
BLAKE2b-256 13ac6cd28543fc1b8b45ef8cf229927a6a499e0f0106e2cc00e2344fae24f748

See more details on using hashes here.

File details

Details for the file redup-0.4.20-py3-none-any.whl.

File metadata

  • Download URL: redup-0.4.20-py3-none-any.whl
  • Upload date:
  • Size: 127.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for redup-0.4.20-py3-none-any.whl
Algorithm Hash digest
SHA256 5d496270cefd74ec734937ba625479646055a66169715119ed3da6d1c989efa8
MD5 b3beaad4b261f3c00f8624fe07865d44
BLAKE2b-256 b50abbaf2aae070193c2f87d94aca4d133b5bb7666e67dd5146a77c45f544b8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page