Matheel: A CLI and Python package for source-code similarity detection.

These details have not been verified by PyPI

Project links

Project description

Matheel

Matheel is a simple, function-based Python package and CLI for source-code similarity. It combines semantic embeddings, lexical similarity, chunking, preprocessing, and code-aware metrics without forcing a class-heavy API.

Installation

Use Python 3.10 to 3.12.

Base install:

pip install matheel

Optional extras:

pip install "matheel[chunking]"
pip install "matheel[chunking_code]"
pip install "matheel[metrics]"
pip install "matheel[model2vec]"
pip install "matheel[pylate]"
pip install "matheel[gradio]"
pip install "matheel[all]"
pip install "matheel[dev]"

matheel[all] installs the currently supported optional backends in one command: Chonkie code chunking, metrics runtime dependencies (RUBY graph/tree, TSED, CodeBERTScore), model2vec, PyLate, and the Gradio app dependencies.

Quick Start

The repo includes a small Java archive at sample_pairs.zip for quick validation.

CLI:

matheel compare sample_pairs.zip \
  --model huggingface/CodeBERTa-small-v1 \
  --feature-weight semantic=0.7 \
  --feature-weight levenshtein=0.3 \
  --threshold 0.2 \
  --num 10

Python:

from matheel.similarity import get_sim_list

results = get_sim_list(
    "sample_pairs.zip",
    model_name="huggingface/CodeBERTa-small-v1",
    threshold=0.2,
    number_results=10,
    feature_weights={
        "semantic": 0.7,
        "levenshtein": 0.3,
    },
)
print(results.head())

Supported Languages

Chunking is language-agnostic by default because it can split any text.
CodeBLEU-style metrics are intentionally scoped to Java, Python, C, and C++.
Generic preprocessing works across languages, but code-aware metrics are most defensible in that four-language scope.

Supported Methods

Similarity features:

semantic
levenshtein
jaro_winkler
winnowing
gst
code_metric

Code metrics:

codebleu
codebleu_ngram
codebleu_weighted_ngram
codebleu_syntax
codebleu_dataflow
crystalbleu
ruby
tsed
codebertscore

ruby now uses a full staged implementation (graph -> tree -> string), with optional runtime dependencies enabled via matheel[metrics].

Chunking methods:

none
Chonkie-backed when installed: code, chonkie_token, chonkie_sentence, chonkie_recursive, chonkie_fast

Vector backends:

auto
sentence_transformers
model2vec
pylate

Single-vector similarity functions:

cosine
dot
euclidean
manhattan

Sentence Transformers pooling methods:

mean
max
cls
lasttoken
mean_sqrt_len_tokens
weightedmean

auto inspects Hugging Face model metadata and routes to the correct backend when the model exposes a known library.

Core Parts

Preprocessing: whitespace and comment normalization before any scoring.
Chunking: Chonkie-backed document splitting with per-method options.
Vectors: dense single-vector, learned static single-vector, and multivector late interaction.
Lexical metrics and baselines: normalized Levenshtein, Jaro-Winkler, Winnowing, and Greedy String Tiling.
Code metrics: built-in CodeBLEU-style metrics and CrystalBLEU.
Comparison suite: run multiple configurations, rank them, and optionally write summary/detail artifacts.

CLI

Compare a directory or ZIP archive:

matheel compare codes/ \
  --model huggingface/CodeBERTa-small-v1 \
  --vector-backend auto \
  --max-token-length 256 \
  --feature-weight semantic=0.6 \
  --feature-weight levenshtein=0.2 \
  --feature-weight jaro_winkler=0.1 \
  --feature-weight code_metric=0.1 \
  --similarity-function dot \
  --pooling-method max \
  --preprocess-mode basic \
  --chunking-method code \
  --chunk-language python \
  --chunker-option include_line_numbers=true \
  --code-metric codebleu \
  --code-language python \
  --threshold 0.5 \
  --num 25

Run multiple configurations:

matheel compare-suite codes/ runs.json \
  --summary-out results/summary.csv \
  --details-dir results/runs

Python API

Pairwise scoring:

from matheel.similarity import calculate_similarity

score = calculate_similarity(
    "def add(a, b):\n    return a + b\n",
    "def add(x, y):\n    return x + y\n",
    model_name="huggingface/CodeBERTa-small-v1",
    vector_backend="auto",
    max_token_length=256,
    similarity_function="dot",
    pooling_method="max",
    preprocess_mode="basic",
    chunking_method="code",
    chunk_language="python",
    code_metric="codebleu",
    code_language="python",
    feature_weights={"semantic": 0.5, "code_metric": 0.5},
)

Directory or ZIP ranking:

from matheel.similarity import get_sim_list

results = get_sim_list(
    "sample_codes",
    model_name="huggingface/CodeBERTa-small-v1",
    threshold=0.4,
    number_results=50,
    vector_backend="auto",
    max_token_length=256,
    chunking_method="chonkie_token",
    chunk_size=120,
    chunk_overlap=20,
    similarity_function="cosine",
    pooling_method="mean",
    feature_weights={
        "semantic": 0.7,
        "levenshtein": 0.15,
        "jaro_winkler": 0.05,
        "winnowing": 0.05,
        "gst": 0.05,
    },
)

Hugging Face Routing

Matheel can inspect Hugging Face model metadata and route automatically:

sentence-transformers models go to the Sentence Transformers path
model2vec models go to the model2vec static path
PyLate models go to the multivector late-interaction path

If metadata is unavailable, Matheel falls back to simple name and tag heuristics, then defaults to the Sentence Transformers path.

Docs

Docs index: docs/index.md
Quick usage: docs/usage.md
Preprocessing: docs/preprocessing.md
Chunking: docs/chunking.md
Vectors: docs/vectors.md
Edit distance and feature weights: docs/lexical.md
Code metrics: docs/code_metrics.md
Comparison suite: docs/comparison_suite.md

The docs/ folder is already structured well for a later GitHub Pages setup if you decide to publish the docs site from the repository.

Examples

Quick archive check: examples/sample_pairs_demo.py
Preprocessing: examples/preprocessing_demo.py
Chunking: examples/chunking_demo.py
Vector backends: examples/vectors_demo.py
Edit distance and feature weights: examples/lexical_demo.py
Code metrics: examples/code_metrics_demo.py
Comparison suite: examples/comparison_suite_demo.py

Gradio

The Gradio demo stays in gradio_app/ and is aligned with the Hugging Face Space setup. The UI supports embeddings, lexical metrics, baseline algorithms (Winnowing and GST), and the code-aware metrics (CodeBLEU, CrystalBLEU, RUBY, TSED, CodeBERTScore), with metric-specific advanced fields. The core package and CLI can read either a ZIP archive or a directory; the Gradio upload flow remains ZIP-based.

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.5

May 4, 2026

0.5.4

May 4, 2026

0.5.3

May 4, 2026

0.5.2

May 4, 2026

0.5.1

May 4, 2026

0.5.0

May 4, 2026

0.4.2

May 3, 2026

0.4.1

May 3, 2026

0.4.0

May 3, 2026

0.3.7

May 2, 2026

0.3.6

May 2, 2026

0.3.5

May 1, 2026

0.3.4

Mar 13, 2026

0.3.3

Mar 11, 2026

0.3.2

Mar 10, 2026

This version

0.3.1

Mar 9, 2026

0.3.0

Mar 4, 2026

0.2.2

Mar 4, 2026

0.2.1

Mar 3, 2026

0.2.0

Mar 3, 2026

0.1.8

Jun 27, 2025

0.1.7

Jun 27, 2025

0.1.6

Apr 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matheel-0.3.1.tar.gz (54.6 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

matheel-0.3.1-py3-none-any.whl (45.1 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file matheel-0.3.1.tar.gz.

File metadata

Download URL: matheel-0.3.1.tar.gz
Upload date: Mar 9, 2026
Size: 54.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for matheel-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`9eb63f03738302745aa9c3e1513613cdf05a4941c24cf555a38238c45672e72e`
MD5	`b43e4c4e6c36350f21126241590a1372`
BLAKE2b-256	`8417b98ff98db207514d30f50e9972875c834eb7dd2af9beefa4175868785e85`

See more details on using hashes here.

File details

Details for the file matheel-0.3.1-py3-none-any.whl.

File metadata

Download URL: matheel-0.3.1-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 45.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for matheel-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`399b004d9c7851db2ff8341fec2ed84936848a2e111a005c27124a98be49c7dd`
MD5	`a982d20a7802d35d7591816da3236088`
BLAKE2b-256	`e49b95371a2fc83aac5ef1d2f24879a69947ed92937b18307070313a1abb5431`

See more details on using hashes here.

matheel 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Matheel

Installation

Quick Start

Supported Languages

Supported Methods

Core Parts

CLI

Python API

Hugging Face Routing

Docs

Examples

Gradio

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes