No project description provided
Project description
Code Similarity Engine
Code Abstraction Opportunities Emerge Naturally
Code that performs similar functions can be abstracted, but traditional tools are not robust against syntactical differences, and therefore miss many abstraciton opportunities. But how is our system more robust? We present a simple CLI tool that performs semantic embedding, comparison, and then k-clustering to find regions of code showing high semantic similarity. This method finds code regions with similar meaning, and using reranking and setting minimums for similarity, this system can find regions of code that can be abstracted reliably and systematically. However, this only shows you where you can make abstractions, but not how. Using Qwen3 0.6B locally, we can go even further to provide a step-by-step guide to abstracting these similarities to reduce code duplication and provide clearer design and syntax throughout your project!
ALWAYS Private, NO TELEMERY!
We explicitly chose to use offline models for everything, and thus this program should never phone home or make any downloads after the initial download of the LLM, Embedder, and Reranking models.
Installation
pip install code-similarity-engine
After installation, download the required models (~1.4 GB):
cse --download-models
Or use the dedicated command:
cse-download-models
Usage
# Basic analysis
cse ./src
# With verbose output
cse ./src -v
# Focus on specific files
cse ./src --focus "*.swift" --focus "*.py"
# Higher precision threshold
cse ./src --threshold 0.90
# With LLM analysis (explains similarities)
cse ./src --analyze
# With reranking (improves cluster quality)
cse ./src --rerank
# Output as JSON
cse ./src -o json > report.json
# Output as Markdown
cse ./src -o markdown > report.md
# Check model status
cse --model-status
Options
cse <path> [options]
Core Options:
-t, --threshold FLOAT Similarity threshold 0.0-1.0 (default: 0.80)
-m, --min-cluster INT Minimum chunks per cluster (default: 2)
-o, --output FORMAT text | markdown | json (default: text)
-v, --verbose Show progress for all stages
Filtering:
-f, --focus PATTERN Only analyze matching paths (repeatable)
-e, --exclude PATTERN Glob patterns to exclude (repeatable)
-l, --lang LANG Force language detection
LLM Analysis:
--analyze / --no-analyze Use LLM to explain clusters
--llm-model PATH Path to LLM GGUF model (auto-detected)
--max-analyze INT Max clusters to analyze (default: 20)
Reranking:
--rerank / --no-rerank Use reranker to improve cluster quality
--rerank-model PATH Path to reranker GGUF model (auto-detected)
--rerank-threshold FLOAT Minimum score to keep (default: 0.5)
Model Management:
--download-models Download all required models and exit
--model-status Show status of all models and exit
--clear-cache Clear cached embeddings
Language Support
| Language | Parser |
|---|---|
| Python | tree-sitter |
| Swift | tree-sitter |
| Rust | tree-sitter |
| JavaScript/TypeScript | tree-sitter |
| Go | tree-sitter |
| Other | sliding window |
Models
CSE uses three Qwen3 GGUF models (~1.4 GB total):
| Model | Size | Purpose |
|---|---|---|
| Qwen3-Embedding-0.6B | 610 MB | Code embeddings |
| Qwen3-0.6B | 378 MB | LLM analysis |
| Qwen3-Reranker-0.6B | 378 MB | Cluster quality |
Models are downloaded to ~/.cache/cse/models/ on first use or via --download-models.
License
GPL-3.0-or-later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file code_similarity_engine-0.1.0.tar.gz.
File metadata
- Download URL: code_similarity_engine-0.1.0.tar.gz
- Upload date:
- Size: 42.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70bfc1696268c6912ba55606f43577a56436ac93badd344d99215df3f4c49864
|
|
| MD5 |
1125f63e6547c5d0ba90c3fdd6a95bad
|
|
| BLAKE2b-256 |
d239cdce05005c2d62321a032eb999d793c61ed5e95930fb121c1cf312a30251
|
File details
Details for the file code_similarity_engine-0.1.0-py3-none-any.whl.
File metadata
- Download URL: code_similarity_engine-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4fe244203d2057d25ca12b429d0d63ad079a7c7b98160386935adf12e46c2fe
|
|
| MD5 |
ce15abf066fb8203b8f1888735bf51a7
|
|
| BLAKE2b-256 |
723d8bfc0b45d48cb09a9d7432ef613ad16da59581f57a8771b8c377f2759d04
|