Skip to main content

Lightweight preprocessing and reversible Modular Linear Tokenization (MLT) utilities for categorical and continuous data.

Project description

PyPI version License: MIT GitHub

🧩 Module: light_mlt.py

Lightweight, reproducible preprocessing pipeline for tabular datasets — integrating categorical encoding via Modular Linear Tokenization (MLT) and continuous feature scaling.
Designed for efficient fit/transform workflows with full reversibility and schema persistence.

📘 Reference
Schmitz, T. (2025). light-mlt: Modular Linear Tokenization for Scalable Categorical Encoding.
DOI: 10.5281/zenodo.17467914


🔖 Key Features

  • Deterministic preprocessing for categorical and continuous data
  • Append-only vocabulary management (cats_map.pkl)
  • Fully reversible categorical encoding via MLT
  • Continuous feature scaling with StandardScaler
  • Persistent schema for consistent transformations
  • Minimal dependencies: numpy, pandas, scikit-learn

📦 Artifacts

Each fit() operation generates or updates the following artifacts (default directory: light_mlt_artifacts/):

File Description
schema.pkl Schema metadata (columns, types, MLT config)
scaler.pkl Trained StandardScaler for continuous columns
cats_map.pkl Append-only {label → id} mapping for categorical features
mlt_params.pkl MLT parameters (p, n, M, Minv) per column
preprocessed.csv Optional transformed dataset export


🧪 Detailed Examples

Below are complete examples demonstrating how to use light_mlt.py for preprocessing, transformation, and reversibility.


1️⃣ Basic Usage

import pandas as pd
from light_mlt import fit_transform, inverse_transform

# Example dataset
df = pd.DataFrame({
    "city": ["São Paulo", "Curitiba", "São Paulo", "Florianópolis"],z
    "vehicle": ["Truck", "Car", "Car", "Bus"],
    "mileage": [12.4, 25.8, 31.5, 18.7],
})

print("=== Original Data ===")
print(df)

# --- Step 1: Fit + Transform ---
df_t, path, token_cols, report = fit_transform(
    df,
    categorical_cols=["city", "vehicle"],
    continuous_cols=["mileage"],
    dir="light_mlt_artifacts/"
)

print("\n=== Transformed Data ===")
print(df_t.head())

print("\nGenerated token columns:", token_cols)
print("Report:", report)

# --- Step 2: Inverse Transform ---
df_rec = inverse_transform(df_t, dir="light_mlt_artifacts/")
print("\n=== Reconstructed Data ===")
print(df_rec)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

light_mlt-0.1.3.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

light_mlt-0.1.3-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file light_mlt-0.1.3.tar.gz.

File metadata

  • Download URL: light_mlt-0.1.3.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for light_mlt-0.1.3.tar.gz
Algorithm Hash digest
SHA256 1829b56f5e9a1c0ec22cb19ed298402d4d5328d0f417133edc7c291939eda18a
MD5 9f5cbb9a9c836fb9e57da9a6a21ba1f8
BLAKE2b-256 90e57905d6fde550e0ebc904611ebda7122d39242bd612b2aa2fcc81961ae43e

See more details on using hashes here.

File details

Details for the file light_mlt-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: light_mlt-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for light_mlt-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 50c36fb5b22481d55f2f74313c844436ea528f73ecb8cbb70d72eefd032a1644
MD5 682623ff4dc8f02fe418b5829e4e15cd
BLAKE2b-256 997b9eb68ee74635eb66f03908b8717e21e4c553b3d90d856ba7d144e944c33b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page