Matheel: A CLI and Python package for source-code similarity detection.
Project description
Matheel
Matheel is a simple, function-based Python package and CLI for source-code similarity. It combines semantic embeddings, lexical similarity, chunking, preprocessing, and code-aware metrics without forcing a class-heavy API.
Installation
Use Python 3.10 to 3.12.
Base install:
pip install matheel
Optional extras:
pip install "matheel[chunking]"
pip install "matheel[chunking_code]"
pip install "matheel[metrics]"
pip install "matheel[model2vec]"
pip install "matheel[pylate]"
pip install "matheel[gradio]"
pip install "matheel[all]"
pip install "matheel[dev]"
matheel[all] installs the currently supported optional backends in one command: Chonkie code chunking, metrics runtime dependencies (RUBY graph/tree, TSED, CodeBERTScore), model2vec, PyLate, and the Gradio app dependencies.
Quick Start
The repo includes a small Java archive at sample_pairs.zip for quick validation.
CLI:
matheel compare sample_pairs.zip \
--model huggingface/CodeBERTa-small-v1 \
--feature-weight semantic=0.7 \
--feature-weight levenshtein=0.3 \
--threshold 0.2 \
--num 10
Python:
from matheel.similarity import get_sim_list
results = get_sim_list(
"sample_pairs.zip",
model_name="huggingface/CodeBERTa-small-v1",
threshold=0.2,
number_results=10,
feature_weights={
"semantic": 0.7,
"levenshtein": 0.3,
},
)
print(results.head())
Supported Languages
- Chunking is language-agnostic by default because it can split any text.
CodeBLEU-style metrics are intentionally scoped toJava,Python,C, andC++.- Generic preprocessing works across languages, but code-aware metrics are most defensible in that four-language scope.
Supported Methods
Similarity features:
semanticlevenshteinjaro_winklerwinnowinggstcode_metric
Code metrics:
codebleucodebleu_ngramcodebleu_weighted_ngramcodebleu_syntaxcodebleu_dataflowcrystalbleurubytsedcodebertscore
ruby now uses a full staged implementation (graph -> tree -> string), with optional runtime dependencies enabled via matheel[metrics].
Chunking methods:
none- Chonkie-backed when installed:
code,chonkie_token,chonkie_sentence,chonkie_recursive,chonkie_fast
Vector backends:
autosentence_transformersmodel2vecpylate
Single-vector similarity functions:
cosinedoteuclideanmanhattan
Sentence Transformers pooling methods:
meanmaxclslasttokenmean_sqrt_len_tokensweightedmean
auto inspects Hugging Face model metadata and routes to the correct backend when the model exposes a known library.
Core Parts
- Preprocessing: whitespace and comment normalization before any scoring.
- Chunking: Chonkie-backed document splitting with per-method options.
- Vectors: dense single-vector, learned static single-vector, and multivector late interaction.
- Lexical metrics and baselines: normalized Levenshtein, Jaro-Winkler, Winnowing, and Greedy String Tiling.
- Code metrics: built-in CodeBLEU-style metrics and CrystalBLEU.
- Comparison suite: run multiple configurations, rank them, and optionally write summary/detail artifacts.
CLI
Compare a directory or ZIP archive:
matheel compare codes/ \
--model huggingface/CodeBERTa-small-v1 \
--vector-backend auto \
--max-token-length 256 \
--feature-weight semantic=0.6 \
--feature-weight levenshtein=0.2 \
--feature-weight jaro_winkler=0.1 \
--feature-weight code_metric=0.1 \
--similarity-function dot \
--pooling-method max \
--preprocess-mode basic \
--chunking-method code \
--chunk-language python \
--chunker-option include_line_numbers=true \
--code-metric codebleu \
--code-language python \
--threshold 0.5 \
--num 25
Run multiple configurations:
matheel compare-suite codes/ runs.json \
--summary-out results/summary.csv \
--details-dir results/runs
Python API
Pairwise scoring:
from matheel.similarity import calculate_similarity
score = calculate_similarity(
"def add(a, b):\n return a + b\n",
"def add(x, y):\n return x + y\n",
model_name="huggingface/CodeBERTa-small-v1",
vector_backend="auto",
max_token_length=256,
similarity_function="dot",
pooling_method="max",
preprocess_mode="basic",
chunking_method="code",
chunk_language="python",
code_metric="codebleu",
code_language="python",
feature_weights={"semantic": 0.5, "code_metric": 0.5},
)
Directory or ZIP ranking:
from matheel.similarity import get_sim_list
results = get_sim_list(
"sample_codes",
model_name="huggingface/CodeBERTa-small-v1",
threshold=0.4,
number_results=50,
vector_backend="auto",
max_token_length=256,
chunking_method="chonkie_token",
chunk_size=120,
chunk_overlap=20,
similarity_function="cosine",
pooling_method="mean",
feature_weights={
"semantic": 0.7,
"levenshtein": 0.15,
"jaro_winkler": 0.05,
"winnowing": 0.05,
"gst": 0.05,
},
)
Hugging Face Routing
Matheel can inspect Hugging Face model metadata and route automatically:
sentence-transformersmodels go to the Sentence Transformers pathmodel2vecmodels go to the model2vec static pathPyLatemodels go to the multivector late-interaction path
If metadata is unavailable, Matheel falls back to simple name and tag heuristics, then defaults to the Sentence Transformers path.
Docs
- Docs index: docs/index.md
- Quick usage: docs/usage.md
- Preprocessing: docs/preprocessing.md
- Chunking: docs/chunking.md
- Vectors: docs/vectors.md
- Edit distance and feature weights: docs/lexical.md
- Code metrics: docs/code_metrics.md
- Comparison suite: docs/comparison_suite.md
The docs/ folder is already structured well for a later GitHub Pages setup if you decide to publish the docs site from the repository.
Examples
- Quick archive check: examples/sample_pairs_demo.py
- Preprocessing: examples/preprocessing_demo.py
- Chunking: examples/chunking_demo.py
- Vector backends: examples/vectors_demo.py
- Edit distance and feature weights: examples/lexical_demo.py
- Code metrics: examples/code_metrics_demo.py
- Comparison suite: examples/comparison_suite_demo.py
Gradio
The Gradio demo stays in gradio_app/ and is aligned with the Hugging Face Space setup. The UI supports embeddings, lexical metrics, baseline algorithms (Winnowing and GST), and the code-aware metrics (CodeBLEU, CrystalBLEU, RUBY, TSED, CodeBERTScore), with metric-specific advanced fields. The core package and CLI can read either a ZIP archive or a directory; the Gradio upload flow remains ZIP-based.
License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file matheel-0.3.1.tar.gz.
File metadata
- Download URL: matheel-0.3.1.tar.gz
- Upload date:
- Size: 54.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9eb63f03738302745aa9c3e1513613cdf05a4941c24cf555a38238c45672e72e
|
|
| MD5 |
b43e4c4e6c36350f21126241590a1372
|
|
| BLAKE2b-256 |
8417b98ff98db207514d30f50e9972875c834eb7dd2af9beefa4175868785e85
|
File details
Details for the file matheel-0.3.1-py3-none-any.whl.
File metadata
- Download URL: matheel-0.3.1-py3-none-any.whl
- Upload date:
- Size: 45.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
399b004d9c7851db2ff8341fec2ed84936848a2e111a005c27124a98be49c7dd
|
|
| MD5 |
a982d20a7802d35d7591816da3236088
|
|
| BLAKE2b-256 |
e49b95371a2fc83aac5ef1d2f24879a69947ed92937b18307070313a1abb5431
|