AST and CFG-based code clone detector for Python focused on architectural duplication
Project description
CodeClone
CodeClone is a Python code clone detector based on normalized AST and Control Flow Graphs (CFG).
It discovers architectural duplication and prevents new copy-paste from entering your codebase via CI.
Why CodeClone
CodeClone focuses on architectural duplication, not text similarity. It detects structural patterns through:
- Normalized AST analysis — robust to renaming, formatting, and minor refactors
- Control Flow Graphs — captures execution logic, not just syntax
- Strict, explainable matching — clear signals, not fuzzy heuristics
Unlike token-based tools, CodeClone compares structure and control flow, making it ideal for finding:
- Repeated service/orchestration patterns
- Duplicated guard/validation blocks
- Copy-pasted handler logic across modules
- Recurring internal segments in large functions
Core Capabilities
Three Detection Levels:
-
Function clones (CFG fingerprint)
Strong structural signal for cross-layer duplication -
Block clones (statement windows)
Detects repeated local logic patterns -
Segment clones (report-only)
Internal function repetition for explainability; not used for baseline gating
CI-Ready Features:
- Deterministic output with stable ordering
- Reproducible artifacts for audit trails
- Baseline-driven gating to prevent new duplication
- Fast incremental analysis with intelligent caching
Installation
pip install codeclone
Requirements: Python 3.10+
Quick Start
Basic Analysis
# Analyze current directory
codeclone .
# Check version
codeclone --version
Generate Reports
codeclone . \
--html .cache/codeclone/report.html \
--json .cache/codeclone/report.json \
--text .cache/codeclone/report.txt
CI Integration
# 1. Generate baseline once (commit to repo)
codeclone . --update-baseline
# 2. Add to CI pipeline
codeclone . --ci
The --ci preset is equivalent to --fail-on-new --no-color --quiet.
Baseline Workflow
Baselines capture the current state of duplication in your codebase. Once committed, they serve as the reference point for CI checks.
Key points (contract-level):
- Baseline file is versioned (
codeclone.baseline.json) and used to classify clones as NEW vs KNOWN. - Compatibility is gated by
schema_version,fingerprint_version, andpython_tag. - Baseline trust is gated by
meta.generator.name(codeclone) and integrity (payload_sha256). - In CI preset (
--ci), an untrusted baseline is a contract error (exit2).
Full contract details: docs/book/06-baseline.md
Exit Codes
CodeClone uses a deterministic exit code contract:
| Code | Meaning |
|---|---|
0 |
Success — run completed without gating failures |
2 |
Contract error — baseline missing/untrusted, invalid output extensions, incompatible versions, unreadable source files in CI/gating |
3 |
Gating failure — new clones detected or threshold exceeded |
5 |
Internal error — unexpected exception |
Priority: Contract errors (2) override gating failures (3) when both occur.
Full contract details: docs/book/03-contracts-exit-codes.md
Debug Support:
# Show detailed error information
codeclone . --debug
# Or via environment variable
CODECLONE_DEBUG=1 codeclone .
Reports
Supported Formats
- HTML (
--html) — Interactive web report with filtering - JSON (
--json) — Machine-readable structured data - Text (
--text) — Plain text summary
Report Schema (JSON v1.1)
The JSON report uses a compact deterministic layout:
- Top-level:
meta,files,groups,groups_split,group_item_layout - Optional top-level:
facts groups_splitprovides explicit NEW / KNOWN separation per sectionmeta.groups_countsprovides deterministic per-section aggregatesmetafollows a shared canonical contract across HTML/JSON/TXT
Canonical report contract: docs/book/08-report.md
Minimal shape (v1.1):
{
"meta": {
"report_schema_version": "1.1",
"codeclone_version": "1.4.0",
"python_version": "3.13",
"python_tag": "cp313",
"baseline_path": "/path/to/codeclone.baseline.json",
"baseline_fingerprint_version": "1",
"baseline_schema_version": "1.0",
"baseline_python_tag": "cp313",
"baseline_generator_name": "codeclone",
"baseline_generator_version": "1.4.0",
"baseline_payload_sha256": "<sha256>",
"baseline_payload_sha256_verified": true,
"baseline_loaded": true,
"baseline_status": "ok",
"cache_path": "/path/to/.cache/codeclone/cache.json",
"cache_used": true,
"cache_status": "ok",
"cache_schema_version": "1.3",
"files_skipped_source_io": 0,
"groups_counts": {
"functions": {
"total": 0,
"new": 0,
"known": 0
},
"blocks": {
"total": 0,
"new": 0,
"known": 0
},
"segments": {
"total": 0,
"new": 0,
"known": 0
}
}
},
"files": [],
"groups": {
"functions": {},
"blocks": {},
"segments": {}
},
"groups_split": {
"functions": {
"new": [],
"known": []
},
"blocks": {
"new": [],
"known": []
},
"segments": {
"new": [],
"known": []
}
},
"group_item_layout": {
"functions": [
"file_i",
"qualname",
"start",
"end",
"loc",
"stmt_count",
"fingerprint",
"loc_bucket"
],
"blocks": [
"file_i",
"qualname",
"start",
"end",
"size"
],
"segments": [
"file_i",
"qualname",
"start",
"end",
"size",
"segment_hash",
"segment_sig"
]
},
"facts": {
"blocks": {}
}
}
Cache
Cache is an optimization layer only and is never a source of truth.
- Default path:
<root>/.cache/codeclone/cache.json - Schema version: v1.3
- Compatibility includes analysis profile (
min_loc,min_stmt) - Invalid or oversized cache is ignored with warning and rebuilt (fail-open)
Full contract details: docs/book/07-cache.md
Pre-commit Integration
repos:
- repo: local
hooks:
- id: codeclone
name: CodeClone
entry: codeclone
language: system
pass_filenames: false
args: [ ".", "--ci" ]
types: [ python ]
What CodeClone Is (and Is Not)
CodeClone Is
- A structural clone detector for Python
- A CI guard against new duplication
- A deterministic analysis tool with auditable outputs
CodeClone Is Not
- A linter or code formatter
- A semantic equivalence prover
- A runtime execution analyzer
How It Works
High-level Pipeline:
- Parse — Python source → AST
- Normalize — AST → canonical structure
- CFG Construction — per-function control flow graph
- Fingerprinting — stable hash computation
- Grouping — function/block/segment clone groups
- Determinism — stable ordering for reproducibility
- Baseline Comparison — new vs known clones (when requested)
Learn more:
- Architecture:
docs/architecture.md - CFG semantics:
docs/cfg.md
Documentation Map
Use this map to pick the right level of detail:
- Contract book (canonical contracts/specs):
docs/book/- Start here:
docs/book/00-intro.md - Exit codes and precedence:
docs/book/03-contracts-exit-codes.md - Baseline contract (schema/trust/integrity):
docs/book/06-baseline.md - Cache contract (schema/integrity/fail-open):
docs/book/07-cache.md - Report contract (schema v1.1 + NEW/KNOWN split):
docs/book/08-report.md - CLI behavior:
docs/book/09-cli.md - HTML rendering:
docs/book/10-html-render.md - Determinism policy:
docs/book/12-determinism.md - Compatibility/versioning rules:
docs/book/14-compatibility-and-versioning.md
- Start here:
- Deep dives:
- Architecture narrative:
docs/architecture.md - CFG semantics:
docs/cfg.md
- Architecture narrative:
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codeclone-1.4.3.tar.gz.
File metadata
- Download URL: codeclone-1.4.3.tar.gz
- Upload date:
- Size: 120.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34b880bed5f7ab9546220001230a7215e610de770e229f966b1acda2c7acde0f
|
|
| MD5 |
438381be58808944c0abe7136b7001f3
|
|
| BLAKE2b-256 |
098b34b51148ad17920b250e80653a7ba34cbbf97f1d312f4c727a7fdc434b57
|
File details
Details for the file codeclone-1.4.3-py3-none-any.whl.
File metadata
- Download URL: codeclone-1.4.3-py3-none-any.whl
- Upload date:
- Size: 84.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfebdce981eace7c296bbad25239f9c7e78aab2ee496e9783e860d9a53690bb8
|
|
| MD5 |
f98b7ed323b28b2f7623ca5feffaa2db
|
|
| BLAKE2b-256 |
5465f348b5e06644c3e14f372cfa89bea8fb54462a3e49571ebc961751869066
|