Skip to main content

Matheel: A CLI and Python package for source-code similarity detection.

Project description

Matheel

Matheel is a function-based Python package and CLI for source-code similarity. It combines semantic embeddings, lexical similarity, chunking, preprocessing, and code-aware metrics in one workflow.

Demos

Installation

Use Python 3.10 to 3.12.

Base install:

pip install matheel

Optional extras:

pip install "matheel[chunking]"
pip install "matheel[chunking_code]"
pip install "matheel[metrics]"
pip install "matheel[model2vec]"
pip install "matheel[pylate]"
pip install "matheel[gradio]"
pip install "matheel[all]"
pip install "matheel[dev]"

matheel[all] installs the currently supported optional backends in one command: Chonkie code chunking, metrics runtime dependencies (RUBY graph/tree, TSED, CodeBERTScore), model2vec, PyLate, and the Gradio app dependencies.

Quick Start

The repository root includes sample_pairs.zip, a small Java archive with:

  • code_1.java
  • code_3_plag.java
  • additional plagiarized and non-plagiarized comparisons

CLI archive comparison:

matheel compare sample_pairs.zip \
  --model huggingface/CodeBERTa-small-v1 \
  --feature-weight semantic=0.7 \
  --feature-weight levenshtein=0.3 \
  --threshold 0.2 \
  --num 10

Python pairwise scoring with the sample pair:

from zipfile import ZipFile

from matheel.similarity import calculate_similarity

with ZipFile("sample_pairs.zip") as archive:
    code_a = archive.read("code_1.java").decode("utf-8")
    code_b = archive.read("code_3_plag.java").decode("utf-8")

score = calculate_similarity(
    code_a,
    code_b,
    model_name="huggingface/CodeBERTa-small-v1",
    feature_weights={
        "semantic": 0.7,
        "levenshtein": 0.3,
    },
)

print(round(score, 4))

Supported Scope

Supported languages:

  • Chunking and preprocessing are text-first and can run on any source text.
  • Code-aware metrics are currently scoped to Java, Python, C, and C++.

Similarity features:

  • semantic
  • levenshtein
  • jaro_winkler
  • winnowing
  • gst
  • code_metric

Code metrics:

  • codebleu
  • codebleu_ngram
  • codebleu_weighted_ngram
  • codebleu_syntax
  • codebleu_dataflow
  • crystalbleu
  • ruby
  • tsed
  • codebertscore

Chunking methods:

  • none
  • code
  • chonkie_token
  • chonkie_sentence
  • chonkie_recursive
  • chonkie_fast

Vector backends:

  • auto
  • sentence_transformers
  • model2vec
  • pylate

Single-vector similarity functions:

  • cosine
  • dot
  • euclidean
  • manhattan

Sentence Transformers pooling methods:

  • mean
  • max
  • cls
  • lasttoken
  • mean_sqrt_len_tokens
  • weightedmean

auto inspects Hugging Face model metadata and routes to the correct backend when the model exposes a known library.

CLI

Compare a directory or ZIP archive:

matheel compare codes/ \
  --model huggingface/CodeBERTa-small-v1 \
  --vector-backend auto \
  --max-token-length 256 \
  --feature-weight semantic=0.6 \
  --feature-weight levenshtein=0.2 \
  --feature-weight jaro_winkler=0.1 \
  --feature-weight code_metric=0.1 \
  --similarity-function dot \
  --pooling-method max \
  --preprocess-mode basic \
  --chunking-method code \
  --chunk-language python \
  --chunker-option include_line_numbers=true \
  --code-metric codebleu \
  --code-language python \
  --threshold 0.5 \
  --num 25

Run multiple configurations:

matheel compare-suite codes/ runs.json \
  --summary-out results/summary.csv \
  --details-dir results/runs

Python API

Pairwise scoring:

from matheel.similarity import calculate_similarity

code_a = "def add(a, b):\n    return a + b\n"
code_b = "def add(x, y):\n    return x + y\n"

score = calculate_similarity(
    code_a,
    code_b,
    model_name="huggingface/CodeBERTa-small-v1",
    vector_backend="auto",
    max_token_length=256,
    similarity_function="dot",
    pooling_method="max",
    preprocess_mode="basic",
    chunking_method="code",
    chunk_language="python",
    code_metric="codebleu",
    code_language="python",
    feature_weights={"semantic": 0.5, "code_metric": 0.5},
)

Directory or ZIP ranking:

from matheel.similarity import get_sim_list

results = get_sim_list(
    "sample_pairs.zip",
    model_name="huggingface/CodeBERTa-small-v1",
    threshold=0.4,
    number_results=10,
    vector_backend="auto",
    max_token_length=256,
    chunking_method="chonkie_token",
    chunk_size=120,
    chunk_overlap=20,
    similarity_function="cosine",
    pooling_method="mean",
    feature_weights={
        "semantic": 0.7,
        "levenshtein": 0.15,
        "jaro_winkler": 0.05,
        "winnowing": 0.05,
        "gst": 0.05,
    },
)

print(results.head())

Docs

Examples

All examples use the same sample pair from sample_pairs.zip: code_1.java and code_3_plag.java.

Gradio

The Gradio demo stays in gradio_app/ and is aligned with the Hugging Face Space setup. The UI supports embeddings, lexical metrics, baseline algorithms (Winnowing and GST), and the code-aware metrics (CodeBLEU, CrystalBLEU, RUBY, TSED, CodeBERTScore), with metric-specific advanced fields. The core package and CLI can read either a ZIP archive or a directory; the Gradio upload flow remains ZIP-based.

Acknowledgments

Matheel builds on several open-source libraries:

The project also depends on the standard scientific Python stack and related tooling, including NumPy, pandas, Click, SentencePiece, and func-timeout.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matheel-0.3.3.tar.gz (49.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matheel-0.3.3-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file matheel-0.3.3.tar.gz.

File metadata

  • Download URL: matheel-0.3.3.tar.gz
  • Upload date:
  • Size: 49.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for matheel-0.3.3.tar.gz
Algorithm Hash digest
SHA256 ca5d18d35a2c526e132cc87568c898ea66fd8dd78d790e88dda63cfcef1e22be
MD5 d7ff6a1bec9c8a08741810639449ac61
BLAKE2b-256 c1e0c4adf34cae69912e69a96283aaf24de263d6653c96c10b1205aeeae6f41f

See more details on using hashes here.

File details

Details for the file matheel-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: matheel-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for matheel-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ad6faaebcb8cdaae9f5f16e5ea43b242d2b74540c6504e995a3531c21439a850
MD5 632a5200c831835cb3d42ac2cb0f3e67
BLAKE2b-256 f55d618be542ab63a3217d438eadf2655788ea8129e3443a6e8bd18b8eb25a41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page