Build Scikit ColumnTransformers by specifying configs.
Project description
Feature Transform
Build Scikit ColumnTransformers by specifying configs.
See also TorchArc to build PyTorch models by specifying architectures.
Installation
pip install feature_transform
Usage
- specify column transformers in a YAML spec file, e.g. at
spec_filepath = "./example/spec/basic.yaml" import feature_transform as ft.- (optional) if you have custom sklearn estimator/preprocessor, e.g.
Dummy, register it withft.register_class(Dummy)
- (optional) if you have custom sklearn estimator/preprocessor, e.g.
- build with:
col_tfm = ft.build(spec_filepath)
The returned object is a sklearn ColumnTransformer ready for normal use.
See more examples below, then see how it works at the end.
Example: build ColumnTransformer from spec file
from pathlib import Path
import joblib
import yaml
from sklearn import datasets
import feature_transform as ft
filepath = Path(".") / "feature_transform" / "example" / "spec" / "basic.yaml"
# The following are equivalent:
# 1. build from YAML spec file
col_tfm = ft.build(filepath)
# 2. build from dictionary
with filepath.open("r") as f:
spec_dict = yaml.safe_load(f)
col_tfm = ft.build(spec_dict)
# 3. use the underlying Pydantic validator to build the col_tfm
spec = ft.Spec(**spec_dict)
col_tfm = spec.build()
Next, load demo data for examples below:
# ================================================
# Load demo data
x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)
x_df.columns
# Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
# 'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
# 'proanthocyanins', 'color_intensity', 'hue',
# 'od280/od315_of_diluted_wines', 'proline'],
# dtype='object')
Example: basic
Spec file: feature_transform/example/spec/basic.yaml
transformers:
- transformer:
preprocessing.StandardScaler:
columns: [alcohol, total_phenols]
- transformer:
preprocessing.RobustScaler:
columns: [ash]
col_tfm = ft.build(ft.SPEC_DIR / "basic.yaml")
feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254, 0.80899739, 0.20143885],
# ...,
# save for later use
joblib.dump(col_tfm, "col_tfm.joblib")
# ... later, e.g. during batch inference
loaded_col_tfm = joblib.load("col_tfm.joblib")
feat_xs = loaded_col_tfm.transform(x_df)
ColumnTransformer col_tfm:
Example: basic with pandas/polars dataframe
Spec file: feature_transform/example/spec/basic.yaml
transformers:
- transformer:
preprocessing.StandardScaler:
columns: [alcohol, total_phenols]
- transformer:
preprocessing.RobustScaler:
columns: [ash]
col_tfm = ft.build(ft.SPEC_DIR / "basic.yaml")
# to use with dataframe, set output to "pandas" or "polars"
col_tfm.set_output(transform="pandas")
feat_x_df = col_tfm.fit_transform(x_df)
feat_x_df
# standardscaler__alcohol standardscaler__total_phenols robustscaler__ash
# 0 1.518613 0.808997 0.201439
# 1 0.246290 0.568648 -0.633094
# ...
feat_x_df.describe()
# standardscaler__alcohol standardscaler__total_phenols robustscaler__ash
# count 1.780000e+02 178.000000 178.000000
# mean -8.382808e-16 0.000000 0.018754
# std 1.002821e+00 1.002821 0.789479
# ...
# save for later use
joblib.dump(col_tfm, "col_tfm.joblib")
# ... later, e.g. during batch inference
loaded_col_tfm = joblib.load("col_tfm.joblib")
feat_x_df = loaded_col_tfm.transform(x_df)
ColumnTransformer col_tfm:
Example: specify name; use int columns
Spec file: feature_transform/example/spec/name-intcol.yaml
transformers:
- name: std
transformer:
preprocessing.StandardScaler:
columns: [0, 5]
- name: robust
transformer:
preprocessing.RobustScaler:
columns: [2]
col_tfm = ft.build(ft.SPEC_DIR / "name-intcol.yaml")
feat_xs = col_tfm.fit_transform(x_df)
# array([[ 1.51861254, 0.80899739, 0.20143885],
# ...,
ColumnTransformer col_tfm:
Example: pipeline
Spec file: feature_transform/example/spec/pipeline.yaml
transformers:
- transformer:
preprocessing.StandardScaler:
columns: [alcohol, total_phenols]
- transformer:
Pipeline:
- impute.SimpleImputer:
strategy: constant
- preprocessing.RobustScaler:
columns: [ash]
col_tfm = ft.build(ft.SPEC_DIR / "pipeline.yaml")
feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254, 0.80899739, 0.20143885],
# ...,
ColumnTransformer col_tfm:
Example: ColumnTransformer settings
Spec file: feature_transform/example/spec/settings.yaml
transformers:
- transformer:
preprocessing.StandardScaler:
columns: [alcohol, total_phenols]
- transformer:
preprocessing.RobustScaler:
columns: [ash]
# use all processors
n_jobs: -1
# for more kwargs see https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html
col_tfm = ft.build(ft.SPEC_DIR / "settings.yaml")
feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254, 0.80899739, 0.20143885],
# ...,
ColumnTransformer col_tfm:
Example: full X, y feature transform with save/load
Spec file (x): feature_transform/example/spec/wine/x.yaml
transformers:
- transformer:
preprocessing.StandardScaler:
columns: [alcohol, total_phenols, flavanoids, nonflavanoid_phenols, od280/od315_of_diluted_wines]
- transformer:
preprocessing.RobustScaler:
columns: [ash, alcalinity_of_ash, proanthocyanins, hue]
- transformer:
preprocessing.PowerTransformer:
columns: [malic_acid, magnesium, color_intensity, proline]
n_jobs: -1
Spec file (y): feature_transform/example/spec/wine/y.yaml
transformers:
- transformer:
preprocessing.OneHotEncoder:
sparse_output: False
columns: [target]
import joblib
from sklearn import datasets
import feature_transform as ft
x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)
y_df = y_sr.to_frame() # ColumnTransformer takes only dataframe/matrix as input
x_col_tfm = ft.build(ft.SPEC_DIR / "wine" / "x.yaml")
y_col_tfm = ft.build(ft.SPEC_DIR / "wine" / "y.yaml")
# fit-transform
feat_xs = x_col_tfm.fit_transform(x_df)
feat_xs
# array([[ 1.51861254, 0.80899739, 1.03481896, ..., 1.69074868,
# 0.45145022, 1.06254129],
# ...,
feat_ys = y_col_tfm.fit_transform(y_df)
feat_ys
# array([[1., 0., 0.],
# ...,
# save for later use
joblib.dump(x_col_tfm, "x_col_tfm.joblib")
joblib.dump(y_col_tfm, "y_col_tfm.joblib")
# ... later, e.g. during batch inference
loaded_x_col_tfm = joblib.load("x_col_tfm.joblib")
feat_xs = loaded_x_col_tfm.transform(x_df)
feat_xs
# array([[ 1.51861254, 0.80899739, 1.03481896, ..., 1.69074868,
# 0.45145022, 1.06254129],
# ...,
ColumnTransformer x_col_tfm:
ColumnTransformer y_col_tfm:
Example: use helper to suggest spec
Most of the time, data preprocessing steps can be determined with rules-of-thumb; ft.suggest does exactly that (see feature_transform/helper.py for details). This produces spec_dict that can be used directly with ft.build or for further editing.
x_df, y_sr = datasets.load_wine(return_X_y=True, as_frame=True)
# suggest spec_dict - use directly or save to yaml for further editing
spec_dict = ft.suggest(x_df)
col_tfm = ft.build(spec_dict)
# fit-transform
feat_xs = col_tfm.fit_transform(x_df)
feat_xs
# array([[ 0.8973384 , 0.20143885, -0.90697674, ..., 0.80804954,
# -0.43546273, 1.69074868],
# ...,
ColumnTransformer col_tfm:
Example: more
See more examples:
- demo notebook from above feature_transform/example/notebook/demo.py
- spec files feature_transform/example/spec/
- unit tests test/validator/test_spec.py
How does it work
Feature Transform simply builds sklearn ColumnTransformer and its estimators/pipelines with 1-1 mapping from a spec file:
- Spec is defined via Pydantic feature_transform/validator/. This defines:
spec: theEstimator, Pipeline, ColumnTransformer
- If spec specifies:
transformers=list[(name, transformer, columns)], then useColumnTransformertransformers=list[(transformer, columns)], then usemake_column_transformerwith auto-generated names
See more in the pydantic spec definition:
- feature_transform/validator/spec.py: the spec used by feature_transform
Guiding principles
The design of Feature Transform is guided as follows:
- simple: the module spec is straightforward:
- it is simply sklearn class name with kwargs.
- it supports official
sklearnestimators,Pipeline, and custom-defined modules registered viaft.register_class
- expressive: it can be used to build both simple and advanced
ColumnTransformereasily - portable: it returns
ColumnTransformerthat can be used anywhere; it is not a framework. - parametrizable: data-based feature transformation unlocks fast experimentation, e.g. by building logic for hyperparameter / data feature search
Development
Setup
Install uv for dependency management if you haven't already. Then run:
# setup virtualenv
uv sync
Unit Tests
uv run pytest
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file feature_transform-1.0.1.tar.gz.
File metadata
- Download URL: feature_transform-1.0.1.tar.gz
- Upload date:
- Size: 645.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70b29b4b8da91c92ce18cb33e58b3ac8f8efa2503776b6cb3a01853ce6c87aa7
|
|
| MD5 |
3982076afde1e133cb0344085fbd1c24
|
|
| BLAKE2b-256 |
e7c615aa682fd6903baa9a9190b795b9f4b55d9798930d6d0e867d4740905ec7
|
File details
Details for the file feature_transform-1.0.1-py3-none-any.whl.
File metadata
- Download URL: feature_transform-1.0.1-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59f36ea70ccea6cafa27c51795de8db9f8bc8998060f43f97b8a4430352fd8d8
|
|
| MD5 |
907f481397441f6e31e7e6ee5978e45a
|
|
| BLAKE2b-256 |
af2066f06a5e030eafbfad8319dc6138e7788926d5cde263c2ff6d39ef868e79
|