Matheel: A CLI and Python package for source-code similarity detection.
Project description
This is the repository for the demonstration paper "Matheel: A Hybrid Source Code Plagiarism Detection Software".
Matheel
Matheel is a Python package for source-code similarity analysis. It keeps a simple, function-based interface while supporting preprocessing, chunking, multiple vector backends, and code-aware metrics.
Features
- Semantic similarity with transformer embeddings, static hashed vectors, or multivector late interaction.
- Lexical similarity with Levenshtein and Jaro-Winkler components.
- Optional code-aware scoring with
CodeBLEU-style components andCrystalBLEU. - Shared core reused by the CLI, Python API, and Gradio app.
- Comparison suite for running multiple configurations and writing publication-friendly summary tables.
Installation
Use Python 3.10 to 3.12. For Apple Silicon, a clean Python 3.12 virtual environment is the safest default.
Example local setup:
python3.12 -m venv .venv
. .venv/bin/activate
Base install:
pip install matheel
Install optional CodeBLEU package support:
pip install "matheel[metrics]"
Install development tools:
pip install "matheel[dev]"
CLI Usage
Basic comparison over a ZIP archive or a directory:
matheel compare codes/ \
--model Salesforce/codet5p-110m-embedding \
--preprocess-mode basic \
--chunking-method tokens \
--chunk-size 120 \
--vector-backend multivector \
--code-metric codebleu \
--code-metric-weight 0.2 \
--threshold 0.5 \
--num 50
Run a comparison suite from a JSON config file:
matheel compare-suite codes/ runs.json \
--summary-out results/summary.csv \
--details-dir results/runs \
--format csv
Example runs.json:
[
{
"run_name": "baseline",
"model_name": "Salesforce/codet5p-110m-embedding",
"number_results": 25
},
{
"run_name": "mv_codebleu",
"model_name": "Salesforce/codet5p-110m-embedding",
"chunking_method": "tokens",
"chunk_size": 120,
"vector_backend": "multivector",
"code_metric": "codebleu",
"code_metric_weight": 0.2,
"number_results": 25
}
]
Python API Usage
Pairwise similarity:
from matheel.similarity import calculate_similarity
score = calculate_similarity(
"int value = 1;",
"int value = 1;",
0.7,
0.2,
0.1,
"Salesforce/codet5p-110m-embedding",
preprocess_mode="basic",
vector_backend="multivector",
chunking_method="tokens",
chunk_size=120,
code_metric="codebleu",
code_metric_weight=0.2,
)
print(score)
Archive-wide ranking from a ZIP file or a directory:
from matheel.similarity import get_sim_list
results = get_sim_list(
"sample_codes",
0.7,
0.2,
0.1,
"Salesforce/codet5p-110m-embedding",
0.5,
50,
preprocess_mode="basic",
chunking_method="tokens",
chunk_size=120,
vector_backend="multivector",
code_metric="crystalbleu",
code_metric_weight=0.15,
)
print(results)
Comparison suite:
from matheel.comparison_suite import run_comparison_suite
summary, results_by_run = run_comparison_suite(
"sample_codes",
[
{"run_name": "baseline", "model_name": "Salesforce/codet5p-110m-embedding"},
{
"run_name": "static_codebleu",
"model_name": "Salesforce/codet5p-110m-embedding",
"vector_backend": "static_hash",
"static_vector_dim": 512,
"code_metric": "codebleu",
"code_metric_weight": 0.2,
},
],
summary_out="results/summary.csv",
details_dir="results/runs",
)
print(summary)
Gradio App
The gradio_app/ folder contains the Gradio interface.
- CLI and Python API can read either a ZIP archive or a directory.
- Gradio keeps the upload flow ZIP-only.
Notes
- Chunking is universal because it only splits text. Preprocessing is mostly generic, with comment/directive handling that is safest for Java, Python, C, and C++ style syntax.
- For publication claims, the code-aware metrics should be treated as officially scoped to
Java,Python,C, andC++.CodeBLEU-style weighting is most defensible in that limited language set. static_hashis a lightweight dependency-free semantic backend for fast baselines.multivectorreuses the selected embedding model over chunks and scores them with late-interaction MaxSim.- CodeBLEU works with a local fallback implementation by default and uses the optional
codebleupackage automatically when installed.
License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
Acknowledgement
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file matheel-0.2.0.tar.gz.
File metadata
- Download URL: matheel-0.2.0.tar.gz
- Upload date:
- Size: 27.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e8746261085526bb2d4b30b31033b8325da3e891751444cb2a2ab1bd47827df
|
|
| MD5 |
39a0322c8c88707028fbe79cb7c3b50d
|
|
| BLAKE2b-256 |
4732290026bc888622d376675a3be36e1c2a9cfa2abac10269c4728be5a590b1
|
File details
Details for the file matheel-0.2.0-py3-none-any.whl.
File metadata
- Download URL: matheel-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d27ee96cc2c7289b29de578d9b3d571ca51978180b3348c9096fdc213ef199c
|
|
| MD5 |
90848c81e4708dbb213543afa0e4244e
|
|
| BLAKE2b-256 |
a3b0bc38a6fba550aa061d440f911c972f66c814ec9f433fe32eb4c7c6f30179
|