Skip to main content

Lightweight orchestration layer that turns pandas DataFrames into front-end-ready JSON schemas, engineered to pair seamlessly with [mlform](https://github.com/UlloaSP/mlform).

Project description

MLSchema

PyPI - Version Python Versions CI License

Lightweight orchestration layer that turns pandas DataFrames into front-end-ready JSON schemas, engineered to pair seamlessly with mlform.

Contents

Overview

mlschema accelerates form and contract generation by automatically deriving JSON field definitions from tabular data. The library applies a strategy-driven pipeline on top of pandas, validating every payload with Pydantic before it reaches your UI tier or downstream services.

  • Converts analytics data into stable JSON schemas in a few lines of code.
  • Keeps inference logic server-side; no external services or background workers required.
  • Ships with production-tested strategies for text, numeric, categorical, boolean, temporal, and two-axis series data.
  • Designed for synchronous use alongside mlform, yet fully usable on its own.

Key Features

  • Strategy registry that lets you opt into only the field types you want to expose.
  • Pydantic v2 models guarantee structural validity and embed domain-specific constraints.
  • Normalized dtype matching covers both pandas extension types and NumPy dtypes.
  • Deterministic JSON output (inputs / outputs) suitable for form engines and low-code tooling.
  • Fully typed public API with strict static analysis (Pyright) and comprehensive tests.

Requirements

  • Python >= 3.14, < 3.15
  • pandas >= 2.3.3, < 3.0.0
  • pydantic >= 2.12.3, < 3.0.0

All transitive dependencies are resolved automatically by your package manager.

Installation

uv add mlschema

Alternative package managers:

  • pip install mlschema
  • poetry add mlschema
  • conda install -c conda-forge mlschema
  • pipenv install mlschema

Pin a version (for example mlschema==0.1.3) when you need deterministic environments.

Quick Start

import pandas as pd
from mlschema import MLSchema
from mlschema.strategies import TextStrategy, NumberStrategy, CategoryStrategy

df = pd.DataFrame(
  {
    "name": ["Ada", "Linus", "Grace"],
    "score": [98.5, 86.0, 91.0],
    "role": pd.Categorical(["engineer", "engineer", "scientist"]),
  }
)

builder = MLSchema()
builder.register(TextStrategy())      # fallback for unsupported dtypes
builder.register(NumberStrategy())
builder.register(CategoryStrategy())

schema = builder.build(df)

Schema Output

The payload is ready to serialise to JSON and inject into your UI or downstream service:

{
  "inputs": [
  {"title": "name", "required": true, "type": "text"},
  {"title": "score", "required": true, "type": "number", "step": 0.1},
  {"title": "role", "required": true, "type": "category", "options": ["engineer", "scientist"]}
  ],
  "outputs": []
}

TextStrategy acts as the default fallback. Make sure it is registered when you want unsupported columns to degrade gracefully.

Series columns

Columns where each cell is a 2-element compound value ((v1, v2), [v1, v2], or {"key1": v1, "key2": v2}) are handled automatically by SeriesStrategy. Sub-field schemas are inferred from the element dtypes via the registered strategies:

import pandas as pd
from datetime import date
from mlschema import MLSchema
from mlschema.strategies import TextStrategy, NumberStrategy, DateStrategy, SeriesStrategy

df = pd.DataFrame({
    "sensor_id": pd.Categorical(["A", "B", "C"]),
    "readings": [
        (date(2024, 1, 1), 23.5),
        (date(2024, 1, 2), 24.1),
        (date(2024, 1, 3), 22.8),
    ],
})

builder = MLSchema()
builder.register(TextStrategy())
builder.register(NumberStrategy())
builder.register(DateStrategy())
builder.register(SeriesStrategy())   # claims compound-cell columns automatically

schema = builder.build(df)
{
  "inputs": [
    {"title": "sensor_id", "required": true, "type": "category", "options": ["A", "B", "C"]},
    {
      "title": "readings", "required": true, "type": "series",
      "field1": {"title": "field1", "required": true, "type": "date", "step": 1},
      "field2": {"title": "field2", "required": true, "type": "number", "step": 0.1}
    }
  ],
  "outputs": []
}

min_points and max_points can be set directly on SeriesField to document cardinality constraints; they are not inferred from data.

How It Works

  1. Registry orchestrationMLSchema keeps an in-memory registry of field strategies, keyed by a logical type_name and one or more pandas dtypes.
  2. Inference pipeline – each DataFrame column is normalised, matched against the registry, and dispatched to the first compatible strategy.
  3. Schema materialisation – strategies merge required metadata (title, type, required) with data-driven attributes, then dump the result through a Pydantic model.
  4. Structured output – the service returns the canonical {"inputs": [...], "outputs": []} payload that feeds mlform or any form rendering layer.

Built-in Strategies

Strategy class type name Supported pandas dtypes Additional attributes
TextStrategy text object, string minLength, maxLength, pattern, value, placeholder
NumberStrategy number int64, int32, float64, float32 min, max, step, value, unit, placeholder
CategoryStrategy category category options, value
BooleanStrategy boolean bool, boolean value
DateStrategy date datetime64[ns], datetime64 min, max, value, step
SeriesStrategy series content-based (2-element cells) field1, field2, min_points, max_points

Register only the strategies you need. Duplicate registrations raise explicit errors; use MLSchema.update() to swap implementations at runtime.

SeriesStrategy uses content-based detection instead of dtype matching — it automatically claims any object column whose cells are all 2-element tuples, lists, or dicts, and infers the sub-field schemas from the element dtypes via the registry.

Extending MLSchema

Create bespoke field types by pairing a custom Pydantic model with a strategy implementation:

from typing import Literal
from pandas import Series
from mlschema.core import BaseField, Strategy


class RatingField(BaseField):
  type: Literal["rating"] = "rating"
  min: int | None = None
  max: int | None = None
  precision: float = 0.5


class RatingStrategy(Strategy):
  def __init__(self) -> None:
    super().__init__(
      type_name="rating",
      schema_cls=RatingField,
      dtypes=("float64",),
    )

  def attributes_from_series(self, series: Series) -> dict:
    return {
      "min": float(series.min()),
      "max": float(series.max()),
    }
  • Use Strategy.dtypes to advertise the pandas dtypes your strategy understands.
  • Avoid mutating the incoming Series; treat it as read-only.
  • Reserved keys (title, type, required, description) are populated by the base class.

Reference the full guide at https://ulloasp.github.io/mlschema/usage/ for end-to-end patterns.

Validation & Error Handling

  • EmptyDataFrameError – raised when the DataFrame has no rows or columns.
  • FallbackStrategyMissingError – triggered if an unsupported dtype is encountered without a registered fallback.
  • StrategyNameAlreadyRegisteredError / StrategyDtypeAlreadyRegisteredError – guard against duplicate registrations.
  • Pydantic ValidationError / PydanticCustomError – surface invalid field constraints early (min/max, regex patterns, date ranges, etc.).

All exceptions derive from mlschema.core.MLSchemaError, making it straightforward to trap library-level failures.

Tooling & Quality

  • Distributed as an MIT-licensed wheel and sdist built with Hatchling.
  • Strict typing (pyright) and linting (ruff) shipped with the repo.
  • Test suite powered by pytest and pytest-cov; coverage reports live alongside the source tree.
  • py.typed marker ensures type information propagates to downstream projects.

Resources

Contributing

Community contributions are welcome. Review the guidelines and pick an issue to get started:

Security

Please report security concerns privately by emailing pablo.ulloa.santin@udc.es. The coordinated disclosure process is documented at https://github.com/UlloaSP/mlschema/blob/main/SECURITY.md.

License

Released under the MIT License. Complete terms and third-party attributions are available at:


Made by Pablo Ulloa Santin and the MLSchema community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlschema-0.1.3.tar.gz (62.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlschema-0.1.3-py3-none-any.whl (35.5 kB view details)

Uploaded Python 3

File details

Details for the file mlschema-0.1.3.tar.gz.

File metadata

  • Download URL: mlschema-0.1.3.tar.gz
  • Upload date:
  • Size: 62.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlschema-0.1.3.tar.gz
Algorithm Hash digest
SHA256 42d62c639a7f17d865c2232a4ee51ec212dacc77ed06dd087fe661c2038b318f
MD5 6d085b9f5561193229da5aa0970c9393
BLAKE2b-256 c635394364eb1c2fadab5829d851f6d94d7c032c4068e568bf1d3bd521cf7b98

See more details on using hashes here.

File details

Details for the file mlschema-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: mlschema-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 35.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlschema-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 47a6723e1a1ecde60b470702a4f3102e84d10cec002328c013ec5d0084dc4c43
MD5 397a92d50f39edb718cf5487bfb4a264
BLAKE2b-256 d7a7bc93263f20ded6a0aa54efe30c9ca4986c80f61e4ed1633ef0ec88eb4486

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page