Skip to main content

AST and CFG-based code clone detector for Python focused on architectural duplication

Project description

CodeClone

PyPI Downloads tests Python CI First Baseline License

CodeClone is a Python code clone detector based on normalized AST and Control Flow Graphs (CFG). It discovers architectural duplication and prevents new copy-paste from entering your codebase via CI.


Why CodeClone

CodeClone focuses on architectural duplication, not text similarity. It detects structural patterns through:

  • Normalized AST analysis — robust to renaming, formatting, and minor refactors
  • Control Flow Graphs — captures execution logic, not just syntax
  • Strict, explainable matching — clear signals, not fuzzy heuristics

Unlike token-based tools, CodeClone compares structure and control flow, making it ideal for finding:

  • Repeated service/orchestration patterns
  • Duplicated guard/validation blocks
  • Copy-pasted handler logic across modules
  • Recurring internal segments in large functions

Core Capabilities

Three Detection Levels:

  1. Function clones (CFG fingerprint) Strong structural signal for cross-layer duplication

  2. Block clones (statement windows) Detects repeated local logic patterns

  3. Segment clones (report-only) Internal function repetition for explainability; not used for baseline gating

CI-Ready Features:

  • Deterministic output with stable ordering
  • Reproducible artifacts for audit trails
  • Baseline-driven gating to prevent new duplication
  • Fast incremental analysis with intelligent caching

Installation

pip install codeclone

Requirements: Python 3.10+


Quick Start

Basic Analysis

# Analyze current directory
codeclone .

# Check version
codeclone --version

Generate Reports

codeclone . \
  --html .cache/codeclone/report.html \
  --json .cache/codeclone/report.json \
  --text .cache/codeclone/report.txt

CI Integration

# 1. Generate baseline once (commit to repo)
codeclone . --update-baseline

# 2. Add to CI pipeline
codeclone . --ci

The --ci preset is equivalent to --fail-on-new --no-color --quiet.


Baseline Workflow

Baselines capture the current state of duplication in your codebase. Once committed, they serve as the reference point for CI checks.

Key points (contract-level):

  • Baseline file is versioned (codeclone.baseline.json) and used to classify clones as NEW vs KNOWN.
  • Compatibility is gated by schema_version, fingerprint_version, and python_tag.
  • Baseline trust is gated by meta.generator.name (codeclone) and integrity (payload_sha256).
  • In CI preset (--ci), an untrusted baseline is a contract error (exit 2).

Full contract details: docs/book/06-baseline.md


Exit Codes

CodeClone uses a deterministic exit code contract:

Code Meaning
0 Success — run completed without gating failures
2 Contract error — baseline missing/untrusted, invalid output extensions, incompatible versions, unreadable source files in CI/gating
3 Gating failure — new clones detected or threshold exceeded
5 Internal error — unexpected exception

Priority: Contract errors (2) override gating failures (3) when both occur.

Full contract details: docs/book/03-contracts-exit-codes.md

Debug Support:

# Show detailed error information
codeclone . --debug

# Or via environment variable
CODECLONE_DEBUG=1 codeclone .

Reports

Supported Formats

  • HTML (--html) — Interactive web report with filtering
  • JSON (--json) — Machine-readable structured data
  • Text (--text) — Plain text summary

Report Schema (JSON v1.1)

The JSON report uses a compact deterministic layout:

  • Top-level: meta, files, groups, groups_split, group_item_layout
  • Optional top-level: facts
  • groups_split provides explicit NEW / KNOWN separation per section
  • meta.groups_counts provides deterministic per-section aggregates
  • meta follows a shared canonical contract across HTML/JSON/TXT

Canonical report contract: docs/book/08-report.md

Minimal shape (v1.1):

{
  "meta": {
    "report_schema_version": "1.1",
    "codeclone_version": "1.4.0",
    "python_version": "3.13",
    "python_tag": "cp313",
    "baseline_path": "/path/to/codeclone.baseline.json",
    "baseline_fingerprint_version": "1",
    "baseline_schema_version": "1.0",
    "baseline_python_tag": "cp313",
    "baseline_generator_name": "codeclone",
    "baseline_generator_version": "1.4.0",
    "baseline_payload_sha256": "<sha256>",
    "baseline_payload_sha256_verified": true,
    "baseline_loaded": true,
    "baseline_status": "ok",
    "cache_path": "/path/to/.cache/codeclone/cache.json",
    "cache_used": true,
    "cache_status": "ok",
    "cache_schema_version": "1.3",
    "files_skipped_source_io": 0,
    "groups_counts": {
      "functions": {
        "total": 0,
        "new": 0,
        "known": 0
      },
      "blocks": {
        "total": 0,
        "new": 0,
        "known": 0
      },
      "segments": {
        "total": 0,
        "new": 0,
        "known": 0
      }
    }
  },
  "files": [],
  "groups": {
    "functions": {},
    "blocks": {},
    "segments": {}
  },
  "groups_split": {
    "functions": {
      "new": [],
      "known": []
    },
    "blocks": {
      "new": [],
      "known": []
    },
    "segments": {
      "new": [],
      "known": []
    }
  },
  "group_item_layout": {
    "functions": [
      "file_i",
      "qualname",
      "start",
      "end",
      "loc",
      "stmt_count",
      "fingerprint",
      "loc_bucket"
    ],
    "blocks": [
      "file_i",
      "qualname",
      "start",
      "end",
      "size"
    ],
    "segments": [
      "file_i",
      "qualname",
      "start",
      "end",
      "size",
      "segment_hash",
      "segment_sig"
    ]
  },
  "facts": {
    "blocks": {}
  }
}

Cache

Cache is an optimization layer only and is never a source of truth.

  • Default path: <root>/.cache/codeclone/cache.json
  • Schema version: v1.3
  • Compatibility includes analysis profile (min_loc, min_stmt)
  • Invalid or oversized cache is ignored with warning and rebuilt (fail-open)

Full contract details: docs/book/07-cache.md


Pre-commit Integration

repos:
  - repo: local
    hooks:
      - id: codeclone
        name: CodeClone
        entry: codeclone
        language: system
        pass_filenames: false
        args: [ ".", "--ci" ]
        types: [ python ]

What CodeClone Is (and Is Not)

CodeClone Is

  • A structural clone detector for Python
  • A CI guard against new duplication
  • A deterministic analysis tool with auditable outputs

CodeClone Is Not

  • A linter or code formatter
  • A semantic equivalence prover
  • A runtime execution analyzer

How It Works

High-level Pipeline:

  1. Parse — Python source → AST
  2. Normalize — AST → canonical structure
  3. CFG Construction — per-function control flow graph
  4. Fingerprinting — stable hash computation
  5. Grouping — function/block/segment clone groups
  6. Determinism — stable ordering for reproducibility
  7. Baseline Comparison — new vs known clones (when requested)

Learn more:


Documentation Map

Use this map to pick the right level of detail:

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codeclone-1.4.4.tar.gz (120.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codeclone-1.4.4-py3-none-any.whl (85.6 kB view details)

Uploaded Python 3

File details

Details for the file codeclone-1.4.4.tar.gz.

File metadata

  • Download URL: codeclone-1.4.4.tar.gz
  • Upload date:
  • Size: 120.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for codeclone-1.4.4.tar.gz
Algorithm Hash digest
SHA256 143a69aa81a09cfc3c16f42752553bf4b69873cd03a95c955658607c69be7f00
MD5 cfbd7773b8da651ebbb17d301ef67e96
BLAKE2b-256 1a4c3441d376821d5b38b766d82aa322b4ef96f6b6cc83df422ea552f212bbb8

See more details on using hashes here.

File details

Details for the file codeclone-1.4.4-py3-none-any.whl.

File metadata

  • Download URL: codeclone-1.4.4-py3-none-any.whl
  • Upload date:
  • Size: 85.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for codeclone-1.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d8991f3f6601617fec1e98c127f39ddd165b9eb3dec2b1fe46c9593d31093218
MD5 5578f17a918e281f5cbc5f22ad2594d4
BLAKE2b-256 f9b30d9544909582ee1f51b0fcf68e6c806204d9682c410dcd6dbee48eaa8eae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page