Skip to main content

Matheel: A CLI and Python package for source-code similarity detection.

Project description

Matheel

Matheel is a simple, function-based Python package and CLI for source-code similarity. It combines semantic embeddings, lexical similarity, chunking, preprocessing, and code-aware metrics without forcing a class-heavy API.

Installation

Use Python 3.10 to 3.12.

Base install:

pip install matheel

Optional extras:

pip install "matheel[metrics]"
pip install "matheel[chunking]"
pip install "matheel[model2vec]"
pip install "matheel[pylate]"
pip install "matheel[dev]"

Supported Languages

  • Chunking is language-agnostic by default because it can split any text.
  • CodeBLEU-style metrics are intentionally scoped to Java, Python, C, and C++.
  • Generic preprocessing works across languages, but code-aware metrics are most defensible in that four-language scope.

Supported Methods

Similarity features:

  • semantic
  • levenshtein
  • jaro_winkler
  • code_metric

Code metrics:

  • codebleu
  • codebleu_ngram
  • codebleu_weighted_ngram
  • codebleu_syntax
  • codebleu_dataflow
  • crystalbleu

Chunking methods:

  • Built-in: none, lines, tokens, characters
  • Chonkie-backed when installed: code, codechunker, chonkie_code, chonkie_token, chonkie_word, chonkie_sentence, chonkie_recursive

Vector backends:

  • auto
  • sentence_transformers
  • model2vec
  • pylate
  • static_hash

auto inspects Hugging Face model metadata and routes to the correct backend when the model exposes a known library.

CLI

Compare a directory or ZIP archive:

matheel compare codes/ \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --vector-backend auto \
  --feature-weight semantic=0.6 \
  --feature-weight levenshtein=0.2 \
  --feature-weight jaro_winkler=0.1 \
  --feature-weight code_metric=0.1 \
  --preprocess-mode basic \
  --chunking-method code \
  --chunk-language python \
  --chunker-option include_line_numbers=true \
  --code-metric codebleu \
  --code-language python \
  --threshold 0.5 \
  --num 25

Run multiple configurations:

matheel compare-suite codes/ runs.json \
  --summary-out results/summary.csv \
  --details-dir results/runs

Python API

Pairwise scoring:

from matheel.similarity import calculate_similarity

score = calculate_similarity(
    "def add(a, b):\n    return a + b\n",
    "def add(x, y):\n    return x + y\n",
    0.7,
    0.2,
    0.1,
    "sentence-transformers/all-MiniLM-L6-v2",
    vector_backend="auto",
    preprocess_mode="basic",
    chunking_method="code",
    chunk_language="python",
    code_metric="codebleu",
    code_language="python",
    feature_weights={"semantic": 0.5, "code_metric": 0.5},
)

Directory or ZIP ranking:

from matheel.similarity import get_sim_list

results = get_sim_list(
    "sample_codes",
    0.7,
    0.2,
    0.1,
    "sentence-transformers/all-MiniLM-L6-v2",
    0.4,
    50,
    vector_backend="auto",
    chunking_method="chonkie_token",
    chunk_size=120,
    chunk_overlap=20,
    feature_weights="semantic=0.7,levenshtein=0.15,jaro_winkler=0.15",
)

Hugging Face Routing

Matheel can inspect Hugging Face model metadata and route automatically:

  • sentence-transformers models go to the Sentence Transformers path
  • model2vec models go to the model2vec static path
  • PyLate models go to the multivector late-interaction path

If metadata is unavailable, Matheel falls back to simple name and tag heuristics, then defaults to the Sentence Transformers path.

Docs and Examples

Gradio

The Gradio demo stays in gradio_app/. The core package and CLI can read either a ZIP archive or a directory. The Gradio upload flow remains ZIP-based.

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matheel-0.2.1.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matheel-0.2.1-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file matheel-0.2.1.tar.gz.

File metadata

  • Download URL: matheel-0.2.1.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for matheel-0.2.1.tar.gz
Algorithm Hash digest
SHA256 336cb86b70f7208a73e1b286f4c9790e445593383425010bc72f34ee90ee11bf
MD5 4ff942527e10c44b874489c966f0f8f4
BLAKE2b-256 729c7cfaae6bd89a94bd21745bc3d3258f965c2bde4ac626311f269fe02f070c

See more details on using hashes here.

File details

Details for the file matheel-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: matheel-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 29.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for matheel-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ecbe9b3377ba8a263549d705b45e44bb620425fd6b86ea803b3f8be2975c841b
MD5 f6df163dc2cf002a78e40b1defcc48b8
BLAKE2b-256 e6fff6bd962539424d13aa96c0473dbff3956084a8bf9b4fe40c9b42de8bc389

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page