Skip to main content

A lightweight Python package for comparing datasets and detecting unexpected changes in machine learning systems.

Project description

mldebug

CI codecov

PyPI Python License

A lightweight Python package for comparing datasets and detecting unexpected changes in machine learning systems.

Why mldebug

Machine learning systems often degrade silently when input data changes, even when models and code remain unchanged.

These issues are typically caused by changes in input data such as:

  • feature distribution drift
  • increasing missing values
  • unseen categorical values
  • mismatch between training and production data

mldebug makes these issues visible early by comparing datasets in a lightweight, schema-driven way and detecting unexpected changes before they impact model performance.

When To Use mldebug

Use mldebug for fast validation of ML datasets, especially in CI or pre-deployment checks.

It is a good fit for:

  • CI/CD validation pipelines
  • pre-deployment data checks
  • schema-based comparison between training and production data
  • lightweight integration into existing ML workflows

Not intended for:

  • full ML observability platforms
  • real-time production monitoring
  • long-term dashboards or alerting infrastructure

What It Does

mldebug compares:

  • a reference dataset (e.g. training data)
  • a current dataset (e.g. production data)

It runs a suite of checks and returns a structured report of detected issues.

Installation

pip install mldebug

Quick Start

Example Usage

from mldebug import run_checks, FeatureType
import numpy as np

reference = {
    "age": np.array([20, 21, 22]),
    "income": np.array([1000, 1200, 1100]),
    "country": np.array(["ES", "ES", "FR"]),
}

current = {
    "age": np.array([30, 35, 40]),
    "income": np.array([900, 800, 850]),
    "country": np.array(["ES", "DE", "DE"]),
}

schema = {
    "age": FeatureType.NUMERIC,
    "income": FeatureType.NUMERIC,
    "country": FeatureType.CATEGORICAL,
}

report = run_checks(reference=reference, current=current, schema=schema)

Output Inspection

Inspect Results

for issue in report.issues:
    print(issue)
[WARNING] variance_drift - age: variance drift detected (ratio=25.0000, threshold=2.0)
[WARNING] range_anomaly - age: 3 values outside [20.0000, 22.0000]
[WARNING] variance_drift - income: variance drift detected (ratio=0.2500, threshold=2.0)
[WARNING] range_anomaly - income: 3 values outside [1000.0000, 1200.0000]
[WARNING] psi_drift - country: PSI drift detected (18.0152)
[WARNING] unseen_categories - country: 1 unseen categories detected (e.g. ['DE'])

Summary

print(report.summary())
{
  "total": 6,
  "by_severity": {
    "info": 0,
    "warning": 6,
    "critical": 0
  },
  "status": "issues_detected"
}

Structured Output

print(report.to_dict())
{
  "issues": [
    {
      "name": "variance_drift",
      "metric": "variance_ratio",
      "severity": "warning",
      "message": "age: variance drift detected (ratio=25.0000, threshold=2.0)",
      "feature": "age",
      "value": 25.000000000000004,
      "threshold": 2.0
    },
    {
      "name": "range_anomaly",
      "metric": "out_of_range_count",
      "severity": "warning",
      "message": "age: 3 values outside [20.0000, 22.0000]",
      "feature": "age",
      "value": 3.0,
      "threshold": 0.0
    },
    {
      "name": "variance_drift",
      "metric": "variance_ratio",
      "severity": "warning",
      "message": "income: variance drift detected (ratio=0.2500, threshold=2.0)",
      "feature": "income",
      "value": 0.25,
      "threshold": 2.0
    },
    {
      "name": "range_anomaly",
      "metric": "out_of_range_count",
      "severity": "warning",
      "message": "income: 3 values outside [1000.0000, 1200.0000]",
      "feature": "income",
      "value": 3.0,
      "threshold": 0.0
    },
    {
      "name": "psi_drift",
      "metric": "psi",
      "severity": "warning",
      "message": "country: PSI drift detected (18.0152)",
      "feature": "country",
      "value": 18.01521528247136,
      "threshold": 0.2
    },
    {
      "name": "unseen_categories",
      "metric": "unseen_category_count",
      "severity": "warning",
      "message": "country: 1 unseen categories detected (e.g. ['DE'])",
      "feature": "country",
      "value": 1.0,
      "threshold": 0.0
    }
  ]
}

Documentation

See documentation pages.

Status

Active development (v0.x). APIs may evolve before v1.0.0.

See CHANGELOG.md for version history.

Development

Requirements

Setup

git clone https://github.com/anpenta/mldebug
cd mldebug
uv sync

Workflow

All tasks are managed via poe.

Run Tests

uv run poe test

Run Linting

uv run poe lint

Check Linting

uv run poe lint-check

Dependency Management

Dependencies are managed using uv and defined in pyproject.toml.

For local development:

uv sync

This installs dependencies and updates the environment as needed.

For CI and reproducible environments:

uv sync --frozen

This ensures the environment exactly matches the lock file without modifying it.

CI

This project uses CI to ensure:

  • code quality (linting and type checking)
  • correctness across supported Python versions
  • test coverage thresholds
  • reproducible builds
  • automated publishing on release tags

Local development runs against the active Python environment only.

See CI workflow for details.

Contributing

We welcome contributions.

  1. Clone the repository
  2. Create a feature branch
  3. Make your changes
  4. Ensure all CI checks pass
  5. Open a pull request

Citation

If you use mldebug in your work, please cite this software.

Preferred citation format is available in CITATION.cff or via GitHub's “Cite this repository” button.

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mldebug-0.5.0.tar.gz (87.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mldebug-0.5.0-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file mldebug-0.5.0.tar.gz.

File metadata

  • Download URL: mldebug-0.5.0.tar.gz
  • Upload date:
  • Size: 87.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mldebug-0.5.0.tar.gz
Algorithm Hash digest
SHA256 6433e36089da161ad3c005528e1f747d456dd22ea3c26178e7328090491e300f
MD5 fa467934d68a1f94d48c3fd59dce0829
BLAKE2b-256 53f99447bc4d05fb72f94deccd59a8388aa2809d142340bde6282fd2b4992ca4

See more details on using hashes here.

Provenance

The following attestation bundles were made for mldebug-0.5.0.tar.gz:

Publisher: ci.yml on anpenta/mldebug

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mldebug-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: mldebug-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 19.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mldebug-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eb417c74be418828f02c1062050dee5d595608f5b5034312c4a410c084653599
MD5 7a1cf096bad8156330737346a00e7f90
BLAKE2b-256 3be207b4ed199c7faf55e24ba72ef164f89dabfe1c23bf79522201fb285107af

See more details on using hashes here.

Provenance

The following attestation bundles were made for mldebug-0.5.0-py3-none-any.whl:

Publisher: ci.yml on anpenta/mldebug

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page