Skip to main content

Automated code duplicate detection and refactoring library

Project description

Recator ๐Ÿ”ง

Recator - Automated code duplicate detection and refactoring library for Python

Python Version License

๐Ÿ“‹ Overview

Recator is a powerful Python library that automatically detects and refactors code duplicates across multiple programming languages using simple heuristics without requiring LLMs. It works efficiently on CPU and supports various programming languages including Python, JavaScript, Java, C/C++, and more.

โœจ Features

  • Multi-language Support: Python, JavaScript, Java, C/C++, C#, PHP, Ruby, Go, Rust, Kotlin, Swift
  • Multiple Detection Algorithms:
    • Exact duplicate detection (hash-based)
    • Token-based similarity detection
    • Fuzzy matching using sequence comparison
    • Structural similarity detection (same structure, different names)
  • Automated Refactoring Strategies:
    • Extract Method - for duplicate code blocks
    • Extract Class - for structural duplicates
    • Extract Module - for file-level duplicates
    • Parameterize - for similar code with differences
  • Safe Mode: Creates .refactored versions without modifying originals
  • CPU Efficient: Uses simple heuristics, no GPU or LLM required
  • Configurable: Adjustable thresholds and parameters

๐Ÿš€ Installation

# Basic installation
pip install recator

# Or install from source
git clone https://github.com/pyfunc/recator.git
cd recator
pip install -e .

# Install with development dependencies
pip install -e ".[dev]"

# Install with advanced features
pip install -e ".[advanced]"

๐Ÿ“– Usage

Command Line Interface

# Basic analysis
recator /path/to/project

# Verbose analysis with custom parameters
recator /path/to/project -v --min-lines 6 --threshold 0.9

# Preview refactoring suggestions
recator /path/to/project --refactor

# Show duplicate code snippets (first N) during analysis
recator /path/to/project --analyze --show-snippets --max-show 5 -v

# Interactive selection of duplicates to refactor (dry-run preview)
recator /path/to/project --refactor --interactive --dry-run

# Refactor on demand by selecting duplicates (IDs or ranges)
# Example selects IDs 1, 3, 4, 5
recator /path/to/project --refactor --select 1,3-5 --dry-run

# Apply refactoring (creates .refactored files)
recator /path/to/project --refactor --apply

# Analyze specific languages only
recator /path/to/project --languages python javascript

# Exclude patterns
recator /path/to/project --exclude "*.test.js" "build/*"

# Save results to JSON
recator /path/to/project --output results.json

# Show duplicate code snippets (once per duplicate) and all occurrences
recator /path/to/project --analyze --show-snippets --max-show 0 --max-blocks 0 -v

# Suppress overlapping/near-identical groups (default) vs show all
recator /path/to/project --analyze -v                         # suppressed
recator /path/to/project --analyze -v --no-suppress-duplicates  # no suppression

# Control snippet preview size in verbose mode
recator /path/to/project --analyze -v --snippet-lines 12

Python API

from recator import Recator

# Initialize with project path
recator = Recator('/path/to/project')

# Analyze for duplicates
results = recator.analyze()
print(f"Found {results['duplicates_found']} duplicates")

# Get detailed duplicate information
for duplicate in results['duplicates']:
    print(f"Type: {duplicate['type']}")
    print(f"Files: {duplicate.get('files', [])}")
    print(f"Confidence: {duplicate.get('confidence', 0)}")

# Preview refactoring
preview = recator.refactor_duplicates(dry_run=True)
print(f"Estimated LOC reduction: {preview['estimated_loc_reduction']}")

# Apply refactoring
refactoring_results = recator.refactor_duplicates(dry_run=False)
print(f"Modified {len(refactoring_results['modified_files'])} files")

Custom Configuration

from recator import Recator

config = {
    'min_lines': 5,                    # Minimum lines for duplicate
    'min_tokens': 40,                  # Minimum tokens for duplicate
    'similarity_threshold': 0.90,      # Similarity threshold (0-1)
    'languages': ['python', 'java'],   # Languages to analyze
    'exclude_patterns': ['*.min.js'],  # Patterns to exclude
    'safe_mode': True,                 # Don't modify originals
}

recator = Recator('/path/to/project', config)
results = recator.analyze()

๐Ÿ” Detection Algorithms

1. Exact Duplicate Detection

Finds identical code blocks using hash comparison.

2. Token-based Detection

Compares token sequences to find duplicates that may have different formatting.

3. Fuzzy Matching

Uses sequence matching algorithms to find similar (but not identical) code.

4. Structural Detection

Identifies code with the same structure but different variable/function names.

๐Ÿ› ๏ธ Refactoring Strategies

Extract Method

# Before: Duplicate blocks in multiple places
def process_user(user):
    # validation block (duplicate)
    if not user.email:
        raise ValueError("Email required")
    if "@" not in user.email:
        raise ValueError("Invalid email")
    # ... processing

def update_user(user):
    # validation block (duplicate)
    if not user.email:
        raise ValueError("Email required")
    if "@" not in user.email:
        raise ValueError("Invalid email")
    # ... updating

# After: Extracted method
def validate_user_email(user):
    if not user.email:
        raise ValueError("Email required")
    if "@" not in user.email:
        raise ValueError("Invalid email")

def process_user(user):
    validate_user_email(user)
    # ... processing

def update_user(user):
    validate_user_email(user)
    # ... updating

Extract Module

Creates shared modules for file-level duplicates.

Parameterize

Converts similar code with differences into parameterized functions.

๐Ÿ“Š Example Output

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘        RECATOR - Code Refactoring Bot     โ•‘
โ•‘     Eliminate Code Duplicates with Ease   โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ” Initializing Recator for: /home/user/project
๐Ÿ”Ž Analyzing project for duplicates...

๐Ÿ“Š Analysis Results:
  โ€ข Total files scanned: 45
  โ€ข Files parsed: 42
  โ€ข Duplicates found: 8

๐Ÿ“‹ Duplicate Details:
  [1] Type: exact_block
      Files: utils.py, helpers.py, validation.py
      Confidence: 100%
      Lines: 12

  [2] Type: fuzzy
      Files: api_client.py, http_handler.py
      Confidence: 87%
      Lines: 25

๐Ÿ”ง Refactoring Preview:
  โ€ข Total actions: 8
  โ€ข Estimated LOC reduction: 147
  โ€ข Affected files: 12

โœ… Done!

๐Ÿ”ง Configuration File

Create a recator.json configuration file:

{
  "min_lines": 4,
  "min_tokens": 30,
  "similarity_threshold": 0.85,
  "languages": ["python", "javascript", "java"],
  "exclude_patterns": [
    "*.min.js",
    "*.min.css",
    "node_modules/*",
    ".git/*",
    "build/*",
    "dist/*"
  ],
  "safe_mode": true
}

Use with: recator /path/to/project --config recator.json

๐Ÿงช Examples

See examples/1/ for a minimal TypeScript example with intentionally duplicated blocks:

# From repository root
recator examples/1 --analyze --languages javascript \
  --min-lines 7 --min-tokens 15 \
  --show-snippets --max-show 0 --max-blocks 0 -v

# Interactive refactor preview for selected duplicates
recator examples/1 --refactor --interactive --dry-run --show-snippets

# Compare suppression behavior
recator examples/1 --analyze --languages javascript --min-lines 7 -v
recator examples/1 --analyze --languages javascript --min-lines 7 -v --no-suppress-duplicates

๐Ÿงฉ Duplicate Snippet Display & On-demand Refactor

  • Show snippets: Use --show-snippets with --analyze to print representative code blocks for duplicates (e.g., exact blocks or token previews). Control output size with --max-show.
  • On-demand refactor: Use --interactive to choose duplicates interactively, or --select 1,3-5 to pass IDs directly. Combine with --refactor and --dry-run for a safe preview. Use --apply --no-dry-run to apply changes where supported.

Tip: Start with stricter thresholds and increase gradually to avoid excessive output on large codebases.

๐Ÿงฑ Portability Notes

Recator uses a pure-Python, stable 64-bit hashing (FNV-1a) to identify identical fragments. This avoids reliance on OpenSSL-backed hashlib algorithms, so it works even in environments where md5/sha* are unavailable.

๐Ÿ—๏ธ Architecture

recator/
โ”œโ”€โ”€ __init__.py       # Main Recator class
โ”œโ”€โ”€ scanner.py        # File scanning and reading
โ”œโ”€โ”€ analyzer.py       # Code parsing and tokenization
โ”œโ”€โ”€ detector.py       # Duplicate detection algorithms
โ”œโ”€โ”€ refactor.py       # Refactoring strategies
โ””โ”€โ”€ cli.py           # Command-line interface

๐Ÿ“ Supported Languages

  • Python (.py)
  • JavaScript/TypeScript (.js, .jsx, .ts, .tsx)
  • Java (.java)
  • C/C++ (.c, .cpp, .cc, .cxx, .h, .hpp)
  • C# (.cs)
  • PHP (.php)
  • Ruby (.rb)
  • Go (.go)
  • Rust (.rs)
  • Kotlin (.kt)
  • Swift (.swift)

โš™๏ธ How It Works

  1. Scanning: Traverses project directory to find source files
  2. Parsing: Tokenizes and parses code into analyzable structures
  3. Detection: Applies multiple algorithms to find duplicates
  4. Analysis: Groups and ranks duplicates by confidence
  5. Refactoring: Suggests or applies appropriate refactoring strategies
  6. Output: Generates modified files or preview reports

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“„ License

This project is licensed under the Apache License 2.0.

๐Ÿ™ Acknowledgments

Built using only Python standard library for maximum compatibility and efficiency.

๐Ÿ“ฎ Support

For issues and questions, please open an issue on GitHub.


Made with โค๏ธ for cleaner, more maintainable code

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

recator-0.1.2.tar.gz (29.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

recator-0.1.2-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file recator-0.1.2.tar.gz.

File metadata

  • Download URL: recator-0.1.2.tar.gz
  • Upload date:
  • Size: 29.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for recator-0.1.2.tar.gz
Algorithm Hash digest
SHA256 62392be5e8a9d74987fbb2df77828e9c39bec613e0cc7541e8cfe0d8675f62aa
MD5 d9562a51cb30ec52c195bcd9f8e1837d
BLAKE2b-256 afc8ee55559a42dda99e93447c802002e7355be3c48c1f207ea17692c634993c

See more details on using hashes here.

File details

Details for the file recator-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: recator-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 26.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for recator-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cada369cdca9f262b380d71024f472d532456d44b37b5ee64246a82a1c70b64f
MD5 f2af20b2a6f9792c9b78a05b1c0a5c95
BLAKE2b-256 5246e50bd10c28a23dc2d5f4213c461785df638951f38c6edf35e319147b086a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page