Skip to main content

No project description provided

Project description

Code Similarity Engine

Code Abstraction Opportunities Emerge Naturally

Code that performs similar functions can be abstracted, but traditional tools are not robust against syntactical differences, and therefore miss many abstraciton opportunities. But how is our system more robust? We present a simple CLI tool that performs semantic embedding, comparison, and then k-clustering to find regions of code showing high semantic similarity. This method finds code regions with similar meaning, and using reranking and setting minimums for similarity, this system can find regions of code that can be abstracted reliably and systematically. However, this only shows you where you can make abstractions, but not how. Using Qwen3 0.6B locally, we can go even further to provide a step-by-step guide to abstracting these similarities to reduce code duplication and provide clearer design and syntax throughout your project!

ALWAYS Private, NO TELEMERY!

We explicitly chose to use offline models for everything, and thus this program should never phone home or make any downloads after the initial download of the LLM, Embedder, and Reranking models.

Installation

pip install code-similarity-engine

After installation, download the required models (~1.4 GB):

cse --download-models

Or use the dedicated command:

cse-download-models

Usage

# Basic analysis
cse ./src

# With verbose output
cse ./src -v

# Focus on specific files
cse ./src --focus "*.swift" --focus "*.py"

# Higher precision threshold
cse ./src --threshold 0.90

# With LLM analysis (explains similarities)
cse ./src --analyze

# With reranking (improves cluster quality)
cse ./src --rerank

# Output as JSON
cse ./src -o json > report.json

# Output as Markdown
cse ./src -o markdown > report.md

# Check model status
cse --model-status

Options

cse <path> [options]

Core Options:
  -t, --threshold FLOAT      Similarity threshold 0.0-1.0 (default: 0.80)
  -m, --min-cluster INT      Minimum chunks per cluster (default: 2)
  -o, --output FORMAT        text | markdown | json (default: text)
  -v, --verbose              Show progress for all stages

Filtering:
  -f, --focus PATTERN        Only analyze matching paths (repeatable)
  -e, --exclude PATTERN      Glob patterns to exclude (repeatable)
  -l, --lang LANG            Force language detection

LLM Analysis:
  --analyze / --no-analyze   Use LLM to explain clusters
  --llm-model PATH           Path to LLM GGUF model (auto-detected)
  --max-analyze INT          Max clusters to analyze (default: 20)

Reranking:
  --rerank / --no-rerank     Use reranker to improve cluster quality
  --rerank-model PATH        Path to reranker GGUF model (auto-detected)
  --rerank-threshold FLOAT   Minimum score to keep (default: 0.5)

Model Management:
  --download-models          Download all required models and exit
  --model-status             Show status of all models and exit
  --clear-cache              Clear cached embeddings

Language Support

Language Parser
Python tree-sitter
Swift tree-sitter
Rust tree-sitter
JavaScript/TypeScript tree-sitter
Go tree-sitter
Other sliding window

Models

CSE uses three Qwen3 GGUF models (~1.4 GB total):

Model Size Purpose
Qwen3-Embedding-0.6B 610 MB Code embeddings
Qwen3-0.6B 378 MB LLM analysis
Qwen3-Reranker-0.6B 378 MB Cluster quality

Models are downloaded to ~/.cache/cse/models/ on first use or via --download-models.

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

code_similarity_engine-0.1.1.tar.gz (43.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

code_similarity_engine-0.1.1-py3-none-any.whl (51.9 kB view details)

Uploaded Python 3

File details

Details for the file code_similarity_engine-0.1.1.tar.gz.

File metadata

  • Download URL: code_similarity_engine-0.1.1.tar.gz
  • Upload date:
  • Size: 43.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for code_similarity_engine-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cba05a695d3c0a84022dc64abc390a20061d3b455c37b0bf81aded82101ae5dd
MD5 133143e1cfe309a6cb2e8380f3e53132
BLAKE2b-256 deaa0b85b9fdac1d1e780707bd6de7d86f7c8ad7041bbeaf37a80406a92fcdf8

See more details on using hashes here.

File details

Details for the file code_similarity_engine-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for code_similarity_engine-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 70f94037ca25383df7b34ddcfbaa4b6029bd511e0db5cae6138c3375cbfd1289
MD5 d5557500031079a175b4d9ab33a6da62
BLAKE2b-256 e5f7bd795e4de8e1e0312d6c68f4783de81411997b32902c34db26836b91431f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page