Matheel: A CLI and Python package for source-code similarity detection.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Matheel

Matheel is a Python package and CLI for source-code similarity. It combines semantic embeddings, lexical similarity, chunking, preprocessing, and code evaluation metrics in one workflow.

Demos

Hugging Face Space demo: buelfhood/matheel-framework
Gradio Colab notebook: Open in Colab
Examples Colab notebook: Open in Colab

Installation

Use Python 3.10 to 3.12. Installation can take some time.

Base install:

pip install matheel

Optional extras:

pip install "matheel[chunking]"
pip install "matheel[chunking_code]"
pip install "matheel[metrics]"
pip install "matheel[model2vec]"
pip install "matheel[pylate]"
pip install "matheel[gradio]"
pip install "matheel[all]"
pip install "matheel[dev]"

matheel[all] installs the currently supported optional backends in one command: Chonkie code chunking, metrics runtime dependencies (RUBY graph/tree, TSED, CodeBERTScore), model2vec, PyLate, and the Gradio app dependencies.

Matheel now ships a native CodeBLEU implementation that uses tree_sitter_language_pack for parser resolution, so real syntax/dataflow scoring no longer depends on installing the pip codebleu package. The pip package is still useful for validation/comparison work if you want to cross-check the native scores on selected examples; Matheel does not currently claim exact pip parity on every possible input.

Quick Start

The repository root includes sample_pairs.zip, a small Java archive with:

code_1.java
code_3_plag.java
additional plagiarised and non-plagiarised comparisons

CLI archive comparison:

matheel compare sample_pairs.zip \
  --model huggingface/CodeBERTa-small-v1 \
  --feature-weight semantic=0.7 \
  --feature-weight levenshtein=0.3 \
  --threshold 0.2 \
  --num 10

Python pairwise scoring with the sample pair:

from zipfile import ZipFile

from matheel.similarity import calculate_similarity

with ZipFile("sample_pairs.zip") as archive:
    code_a = archive.read("code_1.java").decode("utf-8")
    code_b = archive.read("code_3_plag.java").decode("utf-8")

score = calculate_similarity(
    code_a,
    code_b,
    model_name="huggingface/CodeBERTa-small-v1",
    feature_weights={
        "semantic": 0.7,
        "levenshtein": 0.3,
    },
)

print(round(score, 4))

Supported Scope

Supported languages:

Chunking remains text-first and can run on any source text.
Preprocessing heuristics and code-aware metrics are now regression-tested for a unified 20-language scope: Java, Python, C, C++, Go, JavaScript, TypeScript, Kotlin, Scala, Swift, Solidity, Dart, PHP, Ruby, Rust, C#, Lua, Julia, R, and Objective-C (objc).
Native CodeBLEU with real syntax/dataflow now covers that same 20-language scope.

Similarity features:

semantic
levenshtein
jaro_winkler
winnowing
gst
code_metric

Code metrics:

codebleu
codebleu_ngram
codebleu_weighted_ngram
codebleu_syntax
codebleu_dataflow
crystalbleu
ruby
tsed
codebertscore

Chunking methods:

none
code
chonkie_token
chonkie_sentence
chonkie_recursive
chonkie_fast

Vector backends:

auto
sentence_transformers
model2vec
pylate

Single-vector similarity functions:

cosine
dot
euclidean
manhattan

Sentence Transformers pooling methods:

mean
max
cls
lasttoken
mean_sqrt_len_tokens
weightedmean

auto inspects Hugging Face model metadata and routes to the correct backend when the model exposes a known library.

CLI

Compare a directory or ZIP archive:

matheel compare codes/ \
  --model huggingface/CodeBERTa-small-v1 \
  --vector-backend auto \
  --max-token-length 256 \
  --feature-weight semantic=0.6 \
  --feature-weight levenshtein=0.2 \
  --feature-weight jaro_winkler=0.1 \
  --feature-weight code_metric=0.1 \
  --similarity-function dot \
  --pooling-method max \
  --preprocess-mode basic \
  --chunking-method code \
  --chunk-language python \
  --chunker-option include_line_numbers=true \
  --code-metric codebleu \
  --code-language python \
  --threshold 0.5 \
  --num 25

Run multiple configurations:

matheel compare-suite codes/ runs.json \
  --summary-out results/summary.csv \
  --details-dir results/runs

Python API

Pairwise scoring:

from matheel.similarity import calculate_similarity

code_a = "def add(a, b):\n    return a + b\n"
code_b = "def add(x, y):\n    return x + y\n"

score = calculate_similarity(
    code_a,
    code_b,
    model_name="huggingface/CodeBERTa-small-v1",
    vector_backend="auto",
    max_token_length=256,
    similarity_function="dot",
    pooling_method="max",
    preprocess_mode="basic",
    chunking_method="code",
    chunk_language="python",
    code_metric="codebleu",
    code_language="python",
    feature_weights={"semantic": 0.5, "code_metric": 0.5},
)

Directory or ZIP ranking:

from matheel.similarity import get_sim_list

results = get_sim_list(
    "sample_pairs.zip",
    model_name="huggingface/CodeBERTa-small-v1",
    threshold=0.4,
    number_results=10,
    vector_backend="auto",
    max_token_length=256,
    chunking_method="chonkie_token",
    chunk_size=120,
    chunk_overlap=20,
    similarity_function="cosine",
    pooling_method="mean",
    feature_weights={
        "semantic": 0.7,
        "levenshtein": 0.15,
        "jaro_winkler": 0.05,
        "winnowing": 0.05,
        "gst": 0.05,
    },
)

print(results.head())

Docs

Docs folder landing page: docs/README.md
Canonical docs index: docs/index.md
Quick usage: docs/usage.md
Preprocessing: docs/preprocessing.md
Chunking: docs/chunking.md
Vectors and routing: docs/vectors.md
Lexical metrics and baselines: docs/lexical.md
Code metrics: docs/code_metrics.md
Comparison suite: docs/comparison_suite.md
Release checklist: docs/release_checklist.md

Testing

Install the development dependencies before running the test suite:

pip install -e ".[dev]"

Default test runs are intended to be fast and offline-friendly:

pytest

Run the Ruff lint check before opening a pull request:

ruff check .

Real-model integration tests are opt-in because they may need optional backends, cached model weights, or network access:

pytest -m integration

Examples

Colab walkthrough: examples/matheel_examples_colab.ipynb
Quick archive check: examples/sample_pairs_demo.py
Preprocessing: examples/preprocessing_demo.py
Chunking: examples/chunking_demo.py
Vector backends: examples/vectors_demo.py
Lexical metrics and baselines: examples/lexical_demo.py
Code metrics: examples/code_metrics_demo.py
Comparison suite: examples/comparison_suite_demo.py

All examples use the same sample pair from sample_pairs.zip: code_1.java and code_3_plag.java.

Gradio

The Gradio demo stays in gradio_app/ and is aligned with the Hugging Face Space setup. The UI supports embeddings, lexical metrics, baseline algorithms (Winnowing and GST), and the code-aware metrics (CodeBLEU, CrystalBLEU, RUBY, TSED, CodeBERTScore), with metric-specific advanced fields. The core package and CLI can read either a ZIP archive or a directory; the Gradio upload flow remains ZIP-based.

Acknowledgments

Matheel builds on several open-source libraries:

The project also depends on the standard scientific Python stack and related tooling, including NumPy, pandas, Click, SentencePiece, and func-timeout.

License

This project is licensed under the MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

buelfhood

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.5

May 4, 2026

0.5.4

May 4, 2026

0.5.3

May 4, 2026

0.5.2

May 4, 2026

0.5.1

May 4, 2026

0.5.0

May 4, 2026

0.4.2

May 3, 2026

0.4.1

May 3, 2026

0.4.0

May 3, 2026

0.3.7

May 2, 2026

0.3.6

May 2, 2026

This version

0.3.5

May 1, 2026

0.3.4

Mar 13, 2026

0.3.3

Mar 11, 2026

0.3.2

Mar 10, 2026

0.3.1

Mar 9, 2026

0.3.0

Mar 4, 2026

0.2.2

Mar 4, 2026

0.2.1

Mar 3, 2026

0.2.0

Mar 3, 2026

0.1.8

Jun 27, 2025

0.1.7

Jun 27, 2025

0.1.6

Apr 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matheel-0.3.5.tar.gz (68.2 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

matheel-0.3.5-py3-none-any.whl (52.6 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file matheel-0.3.5.tar.gz.

File metadata

Download URL: matheel-0.3.5.tar.gz
Upload date: May 1, 2026
Size: 68.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for matheel-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`1e4e6f4d5f8c6a99b77a817aa60287aed69212e8f732245c295dcbc96a7e01ae`
MD5	`3e96bb9bcb14614aa8bae49bb5826abc`
BLAKE2b-256	`1edfcfc4c8013e09b8f7009771de75a189bd9afd514e6248b5b011e33a94452b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for matheel-0.3.5.tar.gz:

Publisher: publish.yml on FahadEbrahim/matheel

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: matheel-0.3.5.tar.gz
- Subject digest: 1e4e6f4d5f8c6a99b77a817aa60287aed69212e8f732245c295dcbc96a7e01ae
- Sigstore transparency entry: 1418951633
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: FahadEbrahim/matheel@55ee9c4915b9b54f26de69ad5314fca45e9b582d
- Branch / Tag: refs/tags/v0.3.5
- Owner: https://github.com/FahadEbrahim
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@55ee9c4915b9b54f26de69ad5314fca45e9b582d
- Trigger Event: release

File details

Details for the file matheel-0.3.5-py3-none-any.whl.

File metadata

Download URL: matheel-0.3.5-py3-none-any.whl
Upload date: May 1, 2026
Size: 52.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for matheel-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2629824e6befbce59722de5fc9c8ddd78a64f22b89e4c12297ab503ddc3e4701`
MD5	`0b418280910c8e193ae6fb477bbc24ea`
BLAKE2b-256	`4467e4973c34dc582e1c5fe6b35e9fe6cac7975aed38c6fab1b6ca731ea19d09`

See more details on using hashes here.

Provenance

The following attestation bundles were made for matheel-0.3.5-py3-none-any.whl:

Publisher: publish.yml on FahadEbrahim/matheel

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: matheel-0.3.5-py3-none-any.whl
- Subject digest: 2629824e6befbce59722de5fc9c8ddd78a64f22b89e4c12297ab503ddc3e4701
- Sigstore transparency entry: 1418951724
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: FahadEbrahim/matheel@55ee9c4915b9b54f26de69ad5314fca45e9b582d
- Branch / Tag: refs/tags/v0.3.5
- Owner: https://github.com/FahadEbrahim
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@55ee9c4915b9b54f26de69ad5314fca45e9b582d
- Trigger Event: release

matheel 0.3.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

Matheel

Demos

Installation

Quick Start

Supported Scope

CLI

Python API

Docs

Testing

Examples

Gradio

Acknowledgments

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance