Lightweight preprocessing and reversible Modular Linear Tokenization (MLT) utilities for categorical and continuous data.
Project description
🧩 Module: light_mlt.py
Lightweight, reproducible preprocessing pipeline for tabular datasets — integrating categorical encoding via Modular Linear Tokenization (MLT) and continuous feature scaling.
Designed for efficient fit/transform workflows with full reversibility and schema persistence.
📘 Reference
Schmitz, T. (2025). light-mlt: Modular Linear Tokenization for Scalable Categorical Encoding.
DOI: 10.5281/zenodo.17467914
🔖 Key Features
- Deterministic preprocessing for categorical and continuous data
- Append-only vocabulary management (
cats_map.pkl) - Fully reversible categorical encoding via MLT
- Continuous feature scaling with
StandardScaler - Persistent schema for consistent transformations
- Minimal dependencies:
numpy,pandas,scikit-learn
📦 Artifacts
Each fit() operation generates or updates the following artifacts (default directory: light_mlt_artifacts/):
| File | Description |
|---|---|
schema.pkl |
Schema metadata (columns, types, MLT config) |
scaler.pkl |
Trained StandardScaler for continuous columns |
cats_map.pkl |
Append-only {label → id} mapping for categorical features |
mlt_params.pkl |
MLT parameters (p, n, M, Minv) per column |
preprocessed.csv |
Optional transformed dataset export |
🧪 Detailed Examples
Below are complete examples demonstrating how to use light_mlt.py for preprocessing, transformation, and reversibility.
1️⃣ Basic Usage
import pandas as pd
from light_mlt import fit_transform, inverse_transform
# Example dataset
df = pd.DataFrame({
"city": ["São Paulo", "Curitiba", "São Paulo", "Florianópolis"],z
"vehicle": ["Truck", "Car", "Car", "Bus"],
"mileage": [12.4, 25.8, 31.5, 18.7],
})
print("=== Original Data ===")
print(df)
# --- Step 1: Fit + Transform ---
df_t, path, token_cols, report = fit_transform(
df,
categorical_cols=["city", "vehicle"],
continuous_cols=["mileage"],
dir="light_mlt_artifacts/"
)
print("\n=== Transformed Data ===")
print(df_t.head())
print("\nGenerated token columns:", token_cols)
print("Report:", report)
# --- Step 2: Inverse Transform ---
df_rec = inverse_transform(df_t, dir="light_mlt_artifacts/")
print("\n=== Reconstructed Data ===")
print(df_rec)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file light_mlt-0.1.3.tar.gz.
File metadata
- Download URL: light_mlt-0.1.3.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1829b56f5e9a1c0ec22cb19ed298402d4d5328d0f417133edc7c291939eda18a
|
|
| MD5 |
9f5cbb9a9c836fb9e57da9a6a21ba1f8
|
|
| BLAKE2b-256 |
90e57905d6fde550e0ebc904611ebda7122d39242bd612b2aa2fcc81961ae43e
|
File details
Details for the file light_mlt-0.1.3-py3-none-any.whl.
File metadata
- Download URL: light_mlt-0.1.3-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50c36fb5b22481d55f2f74313c844436ea528f73ecb8cbb70d72eefd032a1644
|
|
| MD5 |
682623ff4dc8f02fe418b5829e4e15cd
|
|
| BLAKE2b-256 |
997b9eb68ee74635eb66f03908b8717e21e4c553b3d90d856ba7d144e944c33b
|