Skip to main content

No project description provided

Project description

Code Similarity Engine

Code Abstraction Opportunities Emerge Naturally

Code that performs similar functions can be abstracted, but traditional tools are not robust against syntactical differences, and therefore miss many abstraciton opportunities. But how is our system more robust? We present a simple CLI tool that performs semantic embedding, comparison, and then k-clustering to find regions of code showing high semantic similarity. This method finds code regions with similar meaning, and using reranking and setting minimums for similarity, this system can find regions of code that can be abstracted reliably and systematically. However, this only shows you where you can make abstractions, but not how. Using Qwen3 0.6B locally, we can go even further to provide a step-by-step guide to abstracting these similarities to reduce code duplication and provide clearer design and syntax throughout your project!

ALWAYS Private, NO TELEMERY!

We explicitly chose to use offline models for everything, and thus this program should never phone home or make any downloads after the initial download of the LLM, Embedder, and Reranking models.

Installation

pip install code-similarity-engine

After installation, download the required models (~1.4 GB):

cse --download-models

Or use the dedicated command:

cse-download-models

Usage

# Basic analysis
cse ./src

# With verbose output
cse ./src -v

# Focus on specific files
cse ./src --focus "*.swift" --focus "*.py"

# Higher precision threshold
cse ./src --threshold 0.90

# With LLM analysis (explains similarities)
cse ./src --analyze

# With reranking (improves cluster quality)
cse ./src --rerank

# Output as JSON
cse ./src -o json > report.json

# Output as Markdown
cse ./src -o markdown > report.md

# Check model status
cse --model-status

Options

cse <path> [options]

Core Options:
  -t, --threshold FLOAT      Similarity threshold 0.0-1.0 (default: 0.80)
  -m, --min-cluster INT      Minimum chunks per cluster (default: 2)
  -o, --output FORMAT        text | markdown | json (default: text)
  -v, --verbose              Show progress for all stages

Filtering:
  -f, --focus PATTERN        Only analyze matching paths (repeatable)
  -e, --exclude PATTERN      Glob patterns to exclude (repeatable)
  -l, --lang LANG            Force language detection

LLM Analysis:
  --analyze / --no-analyze   Use LLM to explain clusters
  --llm-model PATH           Path to LLM GGUF model (auto-detected)
  --max-analyze INT          Max clusters to analyze (default: 20)

Reranking:
  --rerank / --no-rerank     Use reranker to improve cluster quality
  --rerank-model PATH        Path to reranker GGUF model (auto-detected)
  --rerank-threshold FLOAT   Minimum score to keep (default: 0.5)

Model Management:
  --download-models          Download all required models and exit
  --model-status             Show status of all models and exit
  --clear-cache              Clear cached embeddings

Language Support

Language Parser
Python tree-sitter
Swift tree-sitter
Rust tree-sitter
JavaScript/TypeScript tree-sitter
Go tree-sitter
Other sliding window

Models

CSE uses three Qwen3 GGUF models (~1.4 GB total):

Model Size Purpose
Qwen3-Embedding-0.6B 610 MB Code embeddings
Qwen3-0.6B 378 MB LLM analysis
Qwen3-Reranker-0.6B 378 MB Cluster quality

Models are downloaded to ~/.cache/cse/models/ on first use or via --download-models.

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_similarity_engine-0.1.0.tar.gz (42.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

code_similarity_engine-0.1.0-py3-none-any.whl (51.9 kB view details)

Uploaded Python 3

File details

Details for the file code_similarity_engine-0.1.0.tar.gz.

File metadata

  • Download URL: code_similarity_engine-0.1.0.tar.gz
  • Upload date:
  • Size: 42.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for code_similarity_engine-0.1.0.tar.gz
Algorithm Hash digest
SHA256 70bfc1696268c6912ba55606f43577a56436ac93badd344d99215df3f4c49864
MD5 1125f63e6547c5d0ba90c3fdd6a95bad
BLAKE2b-256 d239cdce05005c2d62321a032eb999d793c61ed5e95930fb121c1cf312a30251

See more details on using hashes here.

File details

Details for the file code_similarity_engine-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for code_similarity_engine-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4fe244203d2057d25ca12b429d0d63ad079a7c7b98160386935adf12e46c2fe
MD5 ce15abf066fb8203b8f1888735bf51a7
BLAKE2b-256 723d8bfc0b45d48cb09a9d7432ef613ad16da59581f57a8771b8c377f2759d04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page