Skip to main content

Code duplication analyzer and refactoring planner for LLMs

Project description

reDUP

Code duplication analyzer and refactoring planner for LLMs.

PyPI License: Apache-2.0 Python

reDUP scans codebases for duplicated functions, blocks, and structural patterns — then builds a prioritized refactoring map that LLMs can consume to eliminate redundancy systematically.

Features

  • Exact duplicate detection via SHA-256 block hashing
  • Structural clone detection — same AST shape, different variable names
  • Fuzzy near-duplicate matching via SequenceMatcher / rapidfuzz
  • Function-level analysis using Python AST extraction
  • Impact scoring — prioritizes duplicates by saved_lines × similarity
  • Refactoring planner — generates concrete extract/inline suggestions
  • Three output formats: JSON (tooling), YAML (humans), TOON (LLMs)
  • CLI with typer + rich for interactive use

Installation

pip install redup

With optional dependencies:

pip install redup[all]       # Everything
pip install redup[fuzzy]     # rapidfuzz for better similarity matching
pip install redup[ast]       # tree-sitter for multi-language AST
pip install redup[lsh]       # datasketch for LSH near-duplicate detection

Quick Start

CLI

# Scan current directory, output TOON to stdout
redup scan .

# Scan with JSON output saved to file
redup scan ./src --format json --output ./reports/

# Scan with all formats
redup scan . --format all --output ./redup_output/

# Only function-level duplicates (faster)
redup scan . --functions-only

# Custom thresholds
redup scan . --min-lines 5 --min-sim 0.9

# Show installed optional dependencies
redup info

Python API

from pathlib import Path
from redup import ScanConfig, analyze
from redup.reporters.toon_reporter import to_toon
from redup.reporters.json_reporter import to_json

config = ScanConfig(
    root=Path("./my_project"),
    extensions=[".py"],
    min_block_lines=3,
    min_similarity=0.85,
)

result = analyze(config=config, function_level_only=True)

print(f"Found {result.total_groups} duplicate groups")
print(f"Lines recoverable: {result.total_saved_lines}")

# For LLM consumption
print(to_toon(result))

# For tooling / CI
Path("duplication.json").write_text(to_json(result))

Output Formats

TOON (LLM-optimized)

# redup/duplication | 3 groups | 12f 4200L | 2026-03-22

SUMMARY:
  files_scanned: 12
  total_lines:   4200
  dup_groups:    3
  saved_lines:   84

DUPLICATES[3] (ranked by impact):
  [E0001] !! EXAC  calculate_tax  L=8 N=3 saved=16 sim=1.00
      billing.py:1-8  (calculate_tax)
      shipping.py:1-8  (calculate_tax)
      returns.py:1-8  (calculate_tax)

REFACTOR[1] (ranked by priority):
  [1] ○ extract_function   → utils/calculate_tax.py
      WHY: 3 occurrences of 8-line block across 3 files — saves 16 lines
      FILES: billing.py, shipping.py, returns.py

JSON (machine-readable)

{
  "summary": {
    "total_groups": 3,
    "total_saved_lines": 84
  },
  "groups": [
    {
      "id": "E0001",
      "type": "exact",
      "normalized_name": "calculate_tax",
      "fragments": [
        {"file": "billing.py", "line_start": 1, "line_end": 8},
        {"file": "shipping.py", "line_start": 1, "line_end": 8}
      ],
      "saved_lines_potential": 16
    }
  ],
  "refactor_suggestions": [
    {
      "priority": 1,
      "action": "extract_function",
      "new_module": "utils/calculate_tax.py",
      "risk_level": "low"
    }
  ]
}

Architecture

src/redup/
├── __init__.py            # Public API
├── __main__.py            # python -m redup
├── core/
│   ├── models.py          # Pydantic data models
│   ├── scanner.py         # File discovery + block extraction
│   ├── hasher.py          # SHA-256 / structural fingerprinting
│   ├── matcher.py         # Fuzzy similarity comparison
│   ├── planner.py         # Refactoring suggestion generator
│   └── pipeline.py        # Orchestrator: scan → hash → match → plan
├── reporters/
│   ├── json_reporter.py   # JSON output
│   ├── yaml_reporter.py   # YAML output
│   └── toon_reporter.py   # TOON output (LLM-optimized)
└── cli_app/
    └── main.py            # Typer CLI

Analysis Pipeline

1. SCAN      Walk project, read files, extract function-level + sliding-window blocks
2. HASH      Generate exact (SHA-256) and structural (normalized AST) fingerprints
3. GROUP     Bucket by hash, keep only groups with 2+ blocks from different locations
4. MATCH     Verify candidates with fuzzy similarity (SequenceMatcher / rapidfuzz)
5. DEDUP     Remove overlapping groups (keep highest-impact)
6. PLAN      Generate prioritized refactoring suggestions with risk assessment
7. REPORT    Export to JSON / YAML / TOON

Integration with wronai Toolchain

reDUP is part of the wronai developer toolchain:

  • code2llm — static analysis engine (health diagnostics, complexity)
  • reDUP — deep duplication analysis and refactoring planning
  • code2docs — automatic documentation generation
  • vallm — validation of LLM-generated code proposals

Typical workflow:

  1. code2llm analyzes the project → .toon diagnostics
  2. redup finds duplicates → duplication.toon
  3. Feed both to an LLM for targeted refactoring
  4. vallm validates the LLM's proposals before merging

Development

git clone https://github.com/semcod/redup.git
cd redup
pip install -e ".[dev]"
pytest

License

Apache License 2.0 - see LICENSE for details.

Author

Created by Tom Sapletta - tom@sapletta.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redup-0.1.6.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redup-0.1.6-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file redup-0.1.6.tar.gz.

File metadata

  • Download URL: redup-0.1.6.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for redup-0.1.6.tar.gz
Algorithm Hash digest
SHA256 21e8dd49a6fd633d169a01347858e869d8ab0c7b5740624f421484bdf56d0b88
MD5 7cdf54c17e8d86e3cb198d2ca6198aed
BLAKE2b-256 06036859e6cfbeeb8a70198cabc1939c85a860cb6c75a5b68b91283b371d68c3

See more details on using hashes here.

File details

Details for the file redup-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: redup-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for redup-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 026c3f4f9781eac57c1bf20ad52d0206007a060cf4e14a5f3455e8d35956ed83
MD5 cb6a49d5b092fa0584c1aa8d15def7e0
BLAKE2b-256 0c2975903b7ad95bf5ab52cd8409fbd1f8ff91a6156e64045e24c4b9a103968c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page