Matheel: A CLI and Python package for source-code similarity detection.
Project description
Matheel
Matheel is a Python package and CLI for source-code similarity. It combines semantic embeddings, lexical similarity, chunking, preprocessing, and code evaluation metrics in one workflow.
Demos
- Hugging Face Space demo: buelfhood/matheel-framework
- Gradio Colab notebook: Open in Colab
- Examples Colab notebook: Open in Colab
Installation
Use Python 3.10 to 3.12. Installation can take some time.
Base install:
pip install matheel
Optional extras:
pip install "matheel[chunking]"
pip install "matheel[chunking_code]"
pip install "matheel[metrics]"
pip install "matheel[model2vec]"
pip install "matheel[pylate]"
pip install "matheel[gradio]"
pip install "matheel[all]"
pip install "matheel[dev]"
matheel[all] installs the currently supported optional backends in one command: Chonkie code chunking, metrics runtime dependencies (RUBY graph/tree, TSED, CodeBERTScore), model2vec, PyLate, and the Gradio app dependencies.
Matheel now ships a native CodeBLEU implementation that uses tree_sitter_language_pack for parser resolution, so real syntax/dataflow scoring no longer depends on installing the pip codebleu package. The pip package is still useful for validation/comparison work if you want to cross-check the native scores on selected examples; Matheel does not currently claim exact pip parity on every possible input.
Quick Start
The repository root includes sample_pairs.zip, a small Java archive with:
code_1.javacode_3_plag.java- additional plagiarised and non-plagiarised comparisons
CLI archive comparison:
matheel compare sample_pairs.zip \
--model huggingface/CodeBERTa-small-v1 \
--feature-weight semantic=0.7 \
--feature-weight levenshtein=0.3 \
--threshold 0.2 \
--num 10
Python pairwise scoring with the sample pair:
from zipfile import ZipFile
from matheel.similarity import calculate_similarity
with ZipFile("sample_pairs.zip") as archive:
code_a = archive.read("code_1.java").decode("utf-8")
code_b = archive.read("code_3_plag.java").decode("utf-8")
score = calculate_similarity(
code_a,
code_b,
model_name="huggingface/CodeBERTa-small-v1",
feature_weights={
"semantic": 0.7,
"levenshtein": 0.3,
},
)
print(round(score, 4))
Supported Scope
Supported languages:
- Chunking remains text-first and can run on any source text.
- Preprocessing heuristics and code-aware metrics are now regression-tested for a unified 20-language scope:
Java,Python,C,C++,Go,JavaScript,TypeScript,Kotlin,Scala,Swift,Solidity,Dart,PHP,Ruby,Rust,C#,Lua,Julia,R, andObjective-C(objc). - Native CodeBLEU with real syntax/dataflow now covers that same 20-language scope.
Similarity features:
semanticlevenshteinjaro_winklerwinnowinggstcode_metric
Code metrics:
codebleucodebleu_ngramcodebleu_weighted_ngramcodebleu_syntaxcodebleu_dataflowcrystalbleurubytsedcodebertscore
Chunking methods:
nonecodechonkie_tokenchonkie_sentencechonkie_recursivechonkie_fast
Vector backends:
autosentence_transformersmodel2vecpylate
Single-vector similarity functions:
cosinedoteuclideanmanhattan
Sentence Transformers pooling methods:
meanmaxclslasttokenmean_sqrt_len_tokensweightedmean
auto inspects Hugging Face model metadata and routes to the correct backend when the model exposes a known library.
CLI
Compare a directory or ZIP archive:
matheel compare codes/ \
--model huggingface/CodeBERTa-small-v1 \
--vector-backend auto \
--max-token-length 256 \
--feature-weight semantic=0.6 \
--feature-weight levenshtein=0.2 \
--feature-weight jaro_winkler=0.1 \
--feature-weight code_metric=0.1 \
--similarity-function dot \
--pooling-method max \
--preprocess-mode basic \
--chunking-method code \
--chunk-language python \
--chunker-option include_line_numbers=true \
--code-metric codebleu \
--code-language python \
--threshold 0.5 \
--num 25
Run multiple configurations:
matheel compare-suite codes/ runs.json \
--summary-out results/summary.csv \
--details-dir results/runs
Python API
Pairwise scoring:
from matheel.similarity import calculate_similarity
code_a = "def add(a, b):\n return a + b\n"
code_b = "def add(x, y):\n return x + y\n"
score = calculate_similarity(
code_a,
code_b,
model_name="huggingface/CodeBERTa-small-v1",
vector_backend="auto",
max_token_length=256,
similarity_function="dot",
pooling_method="max",
preprocess_mode="basic",
chunking_method="code",
chunk_language="python",
code_metric="codebleu",
code_language="python",
feature_weights={"semantic": 0.5, "code_metric": 0.5},
)
Directory or ZIP ranking:
from matheel.similarity import get_sim_list
results = get_sim_list(
"sample_pairs.zip",
model_name="huggingface/CodeBERTa-small-v1",
threshold=0.4,
number_results=10,
vector_backend="auto",
max_token_length=256,
chunking_method="chonkie_token",
chunk_size=120,
chunk_overlap=20,
similarity_function="cosine",
pooling_method="mean",
feature_weights={
"semantic": 0.7,
"levenshtein": 0.15,
"jaro_winkler": 0.05,
"winnowing": 0.05,
"gst": 0.05,
},
)
print(results.head())
Docs
- Docs folder landing page: docs/README.md
- Canonical docs index: docs/index.md
- Quick usage: docs/usage.md
- Preprocessing: docs/preprocessing.md
- Chunking: docs/chunking.md
- Vectors and routing: docs/vectors.md
- Lexical metrics and baselines: docs/lexical.md
- Code metrics: docs/code_metrics.md
- Comparison suite: docs/comparison_suite.md
- Release checklist: docs/release_checklist.md
Testing
Install the development dependencies before running the test suite:
pip install -e ".[dev]"
Default test runs are intended to be fast and offline-friendly:
pytest
Run the Ruff lint check before opening a pull request:
ruff check .
Real-model integration tests are opt-in because they may need optional backends, cached model weights, or network access:
pytest -m integration
Examples
- Colab walkthrough: examples/matheel_examples_colab.ipynb
- Quick archive check: examples/sample_pairs_demo.py
- Preprocessing: examples/preprocessing_demo.py
- Chunking: examples/chunking_demo.py
- Vector backends: examples/vectors_demo.py
- Lexical metrics and baselines: examples/lexical_demo.py
- Code metrics: examples/code_metrics_demo.py
- Comparison suite: examples/comparison_suite_demo.py
All examples use the same sample pair from sample_pairs.zip: code_1.java and code_3_plag.java.
Gradio
The Gradio demo stays in gradio_app/ and is aligned with the Hugging Face Space setup. The UI supports embeddings, lexical metrics, baseline algorithms (Winnowing and GST), and the code-aware metrics (CodeBLEU, CrystalBLEU, RUBY, TSED, CodeBERTScore), with metric-specific advanced fields. The core package and CLI can read either a ZIP archive or a directory; the Gradio upload flow remains ZIP-based.
Acknowledgments
Matheel builds on several open-source libraries:
- Sentence Transformers
- Chonkie
- model2vec
- PyLate
- RapidFuzz
- tree-sitter-language-pack
- NetworkX
- APTED
- bert-score
- Gradio
The project also depends on the standard scientific Python stack and related tooling, including NumPy, pandas, Click, SentencePiece, and func-timeout.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file matheel-0.3.5.tar.gz.
File metadata
- Download URL: matheel-0.3.5.tar.gz
- Upload date:
- Size: 68.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e4e6f4d5f8c6a99b77a817aa60287aed69212e8f732245c295dcbc96a7e01ae
|
|
| MD5 |
3e96bb9bcb14614aa8bae49bb5826abc
|
|
| BLAKE2b-256 |
1edfcfc4c8013e09b8f7009771de75a189bd9afd514e6248b5b011e33a94452b
|
Provenance
The following attestation bundles were made for matheel-0.3.5.tar.gz:
Publisher:
publish.yml on FahadEbrahim/matheel
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
matheel-0.3.5.tar.gz -
Subject digest:
1e4e6f4d5f8c6a99b77a817aa60287aed69212e8f732245c295dcbc96a7e01ae - Sigstore transparency entry: 1418951633
- Sigstore integration time:
-
Permalink:
FahadEbrahim/matheel@55ee9c4915b9b54f26de69ad5314fca45e9b582d -
Branch / Tag:
refs/tags/v0.3.5 - Owner: https://github.com/FahadEbrahim
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@55ee9c4915b9b54f26de69ad5314fca45e9b582d -
Trigger Event:
release
-
Statement type:
File details
Details for the file matheel-0.3.5-py3-none-any.whl.
File metadata
- Download URL: matheel-0.3.5-py3-none-any.whl
- Upload date:
- Size: 52.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2629824e6befbce59722de5fc9c8ddd78a64f22b89e4c12297ab503ddc3e4701
|
|
| MD5 |
0b418280910c8e193ae6fb477bbc24ea
|
|
| BLAKE2b-256 |
4467e4973c34dc582e1c5fe6b35e9fe6cac7975aed38c6fab1b6ca731ea19d09
|
Provenance
The following attestation bundles were made for matheel-0.3.5-py3-none-any.whl:
Publisher:
publish.yml on FahadEbrahim/matheel
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
matheel-0.3.5-py3-none-any.whl -
Subject digest:
2629824e6befbce59722de5fc9c8ddd78a64f22b89e4c12297ab503ddc3e4701 - Sigstore transparency entry: 1418951724
- Sigstore integration time:
-
Permalink:
FahadEbrahim/matheel@55ee9c4915b9b54f26de69ad5314fca45e9b582d -
Branch / Tag:
refs/tags/v0.3.5 - Owner: https://github.com/FahadEbrahim
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@55ee9c4915b9b54f26de69ad5314fca45e9b582d -
Trigger Event:
release
-
Statement type: