Simple utilities for handling and managing exploratory and experimental machine learning development.
Project description
mlexpy
Simple utilities for handling and managing exploratory and experimental machine learning development.
Instillation
mlexpy
can be installed via:
pip install mlexpy
- As a beta release, any input is highly encouraged and likely to be implemented. Please feel free to open an issue.
Introduction:
Design principles:
mlexpy
is not meant to be a tool deployed in a production prediction environment. Alternatively, it is meant to provide a simple structure to organize different components of machine learning to simplify basic ML exploration and experimentation, and hopefully improve ML results via standardized and reproducible experimentation.
The core goal is to leverage fairly basic, yet powerful, and clear data structures and patterns to improve the "workspace" for ML development. Hopefully, this library can trivialize some common ML development tasks, to allow developers, scientist, and any one else to spend more time in the investigations in ML, and less time in coding or developing a reliable, readable, and reproducible exploratory codebase / script / notebook.
mlexpy
provides no explicit ML models or data. Instead it provides various tools to store, interact, and wrap different models, methods, or datasets.
mlexpy
is not meant to replace any work of a statistician or data scientist. But it is meant to improve their workflow with boilerplate code and "infrastructure" setup to perform better experiments faster.
High level library goals:
-
- Provide intuitive, standardizable, and reproducible ML experiments.
-
- Methodological understandability is more important that method simplicity and efficiency.
- Because this is meant to aid in ML development, often times it is easy to lose track of what explicitly steps are and were done in ultimately producing a prediction.
mlexpy
is meant to reduce the amount of code written for ML development purely to the explicit steps taken in developing a model. For example, Pandas DataFrames are currently preferred over numpy nd.arrays simply for column readability. Thus, verbosity is preferred. Ideally, by dumping a majority of the behind the scenes code tomlexpy
this verbosity is still manageable. - Ideally,
mlexpy
makes is simple to understand exactly what a ML pipeling and model are doing, easing collaboration between engineers, coworker, and academics.
-
mlexpy
is not developed (yet?) for usage in large scale deep-learning tasks. Perhaps later this will be on interest.
-
mlexpy
leverages an OOP framework, while this may be less intuitive for some practitioners, the benefits of becoming familiar with some OOP outweigh its learning curve. -
The three junctions of the ML development pipeline are (1) the data -> (2) the processing -> (3) the predictions. Thus, each one of these steps is meant to be identically reproducible independent of each other (to the extent possible).
-
For example: I have an update to my dataset -- I would like to know exactly how a previously developed and stored pipeline and model will perform on this new dataset.
-
I have a new model, I would like to know how it performs given exactly the same data and processing steps.
-
I would like to try computing a feature differently in the processing steps. However, to compare to previous models, I would like to use exactly the same data, exactly the same process for the other features, and exactly the same model and training steps (ex. the same Cross Validation, parameter searching, and tuning).
-
Note: If the used features change, or the provided features change; all downstream process must change too in order to accommodate the structural change in the process, and no-longer can exact comparisons of a single component (data, process, model) be compared.
-
Note: Currently, mlexpy
only provides tooling for supervised learning.
Structure
mlexpy
is made up of 2 core modules, processor
and experiment
. These 2 modules interact to provide a clean space for ML development, with limited need for boilerplate "infrastructure" code.
Expected usage:
At minimum using mlexpy
requires defining a class that defines a .process_data()
method that inherits the processor.ProcessPipelineBase
class. The ProcessPipelineBase
class provides a variety of boilerplate ML tooling. An example workflow is shown below.
The plain language usage of mlexpy
:
-
Do all of your processing of raw data in
.process_data()
method of yourProcessPipelineBase
's child class. -
Do all of your model experimentation via the
experiment
module. This is done the{Classifier, Regression}Experiment
class. The{Classifier, Regression}Experiment
class manages your test and train sets, file naming conventions, and random seeds internally. However, the user needs to define the pipeline class to use via the{Classifier, Regression}Experiment.set_pipeline()
method.
These two classes provide the minimum needed infrastructure from a user to perform model training and evaluation in a highly structured manner, as outlined in the principles and goals above.
- An example simple case of using
mlexpy
can be found inexamples/0_classification_example.ipynb
Example pseudo-code
# Get mlexpy modules
from mlexpy.processor import ProcessPipelineBase
from mlexpy.experiment import ClassifierExperiment
# (1) Load data
dataset = <...do-your-loading-here...>
# (2) Define your class to perform data processing
class SimplePipeline(PipelineProcessorBase):
def __init__(self, <all initialization arguments>)
super().__init(<all initialization arguments>)
def process_data(self, df: pd.DataFrame) -> pd.DataFrame:
# Do a copy of the passed df
df = df.copy()
# Create some feature
df["new_feature"] = df["existing_feature"] * df["other_existing_feature"]
# and return the resulting df
return df
# (3) Now you can run your experiments using a random forest for example.
# Define the experiment
simple_experiment = ClassifierExperiment(
train_setup=dataset.train_data,
test_setup=dataset.test_data,
cv_split_count=20,
model_tag="example_development_model",
process_tag="example_development_process",
model_dir=Path.cwd()
)
# Set the pipeline to use...
simple_experiment.set_pipeline(SimplePipeline)
# Now begin the experimentation, start with performing the data processing...
processed_datasets = simple_experiment.process_data()
# ... then train the model...
trained_model = simple_experiment.train_model(
RandomForestClassifier(),
processed_datasets,
# params=model_algorithm.hyperparams, # If this is passed, then cross validation search is performed, but slow.
)
# Get the predictions and evaluate the performance.
predictions = experiment.predict(processed_datasets, trained_model)
results = experiment.evaluate_predictions(processed_datasets, predictions=predictions)
Data structures and functions
mlexpy
uses a few namedtuple data structures for storing data. These are used to act as immutable objects that contain training and test splits, that can be easily understood via the named fields and dot notation.
-
MLSetup
Is a named tuple meant to store data for an ML prediction "run". It stores 2 variables:MLSetup.obs
storing the actual features dataframeMLSetup.labels
storing the respective ground truth values
-
ExperimentSetup
Is a higher level structure meant to provide all data for an experiment. Using themlexpy.pipeline_utils.get_stratified_train_test_data()
function will return anExperimentSetup
named tuple, containing 2MLExperiment
named tuples:ExperimentSetup.train_data
the training features and labels (stored as anMLSetup
)ExperimentSetup.test_data
the testing features and labels (stored as anMLSetup
)
-
pipeline_utils.get_stratified_train_test_data(train_data: pd.DataFrame, label_data: pd.Series, random_state: np.random.RandomState, test_frac: float = 0.3) -> ExperimentSetup:
Performs a simple training at testing split of the data, however by default stratifies the dataset. -
utils.initial_filtering(df: pd.DataFrame, column_mask_functions: Dict[str, List[Callable]]) -> pd.DataFrame:
A simple function to perform any known filtering of the dataset prior to any training to testing split that might be desired, ex. droppingnan
s, removing non-applicable cases, dropping duplicates, etc.
To simplify this task, and reduce boiler plate needed code, this function only needs a dictionary with values that provide boolean outputs of True
on the records to keep for a given column. Any record with at least 1 value of False
will be dropped from the dataset. An example dictionary is shown below is shown below:
DROP_PATTERNS = ["nan", "any-other-substring-i-dont-want"]
def response_to_drop(s: str) -> bool:
"""For any string, return True if any of the substrings in drop patterns are present in the string s."""
return not any([drop_pattern in s for drop_pattern in DROP_PATTERNS])
def non_numeric_response(s: str) -> bool:
"""For any string, return True if the string is NOT of a numeric value. (ex not '2')"""
return s.isnumeric()
column_filter_dict: Dict[str, List[Callable]] = {
"response": [response_to_drop],
"time_in_seconds": [non_numeric_response],
}
processor
module
The processor
module is meant to provide a clean space to perform any processing / feature engineering, is a structured manner. At minimum, this translates to defining the .process_data()
method. A description of the critical methods required are provided below. A method that requires to be overridden by the child class wil raise a NotImplementedError
.
.process_data(self, df: pd.DataFrame, training: bool = True) -> pd.DataFrame:
This method performs your feature engineering. Note that this method (and supplementary provided methods) are made to easily handle processing when training, and when testing. A suggested template is:
def process_data(self, df: pd.DataFrame, training: bool = True) -> pd.DataFrame:
"""Perform here all steps of the data processing for feature engineering."""
logger.info(f"Beginning to process all data of size {df.shape}.")
# First, do an "hard-coded" transformations or feature engineering, meaning something that does not need any models (such as multiplying values, or domain specific calculations)...
<...your-code-here...>
logger.info(f"... processing complete for training={training}.")
# ... now, handle any cases where features require a model (such as PCA, statistical scaling, onehot-encoding) to be trained, that are now appropriate to be trained on the testing data...
if training:
self.fit_model_based_features(feature_df)
feature_df = self.transform_model_based_features(feature_df)
self.dump_feature_based_models() # this is optional
else:
# because we are training, we need to use the models already trained.
feature_df = self.transform_model_based_features(feature_df)
# Return the
return feature_df
.fit_model_based_features(self, df: pd.DataFrame) -> None:
This method performs all of your fitting of models to be used in model based features. Once fit, these models are stored in an DefaultOrderedDictionary
with dataframe column names as keys, and the list of the models to apply as a list as the values. This dictionary applies models in the exact same order as they were fit -- this way the steps of a pipeline can be preserved. An example is provided showing how a standard scaler is fit for every numerical column, and what the .fit_scaler()
method might look like.
def fit_model_based_features(self, df: pd.DataFrame) -> None:
"""Here to model fitting for a transformation."""
for column in df.columns:
if df[column].dtype not in ("float", "int"):
continue
self.fit_scaler(df[column])
def fit_scaler(
self, feature_data: pd.Series, standard_scaling: bool = True
) -> pd.Series:
"""Perform the feature scaling here. If this a prediction method, then load and fit."""
if standard_scaling:
logger.info(f"Fitting a standard scaler to {feature_data.name}.")
scaler = StandardScaler()
else:
logger.info(f"Fitting a minmax scaler to {feature_data.name}.")
scaler = MinMaxScaler()
{Classification, Regression}Experiment
module
The {Classification, Regression}Experiment
modules provide a clean space to perform model training in a structured manner. The classification class and the regression classes are developed for each problem type. For the remainder of this description, a classification case will be applied, however is similar to regression. The Experiment classes require a pipeline attribute to be set via the .set_pipeline()
method, where you should pass your developed child of ProcessPipelineBase
as an argument. Via the Experiment class you can run .process_data()
which will perform all processing defined in your ProcessPipelineBase
's child class's .process_data()
method.
.process_data(self, process_method: Optional[str] = None) -> ExperimentSetup:
This method performs all data processing for both the training and testing data. The process_method_str
argument is the name of the method you would like the processor class (in the example below YourPipelineChildClass
) to use to process the data. By default this will be .process_data()
however does not need to be. In this manner you can experiment with different processing methods in your pipeline class, and store them in code, by simply passing different names inside of this function.
Model training in the ExperimentBase
class:
Training an ML model is very simple using mlexpy
. All of the necessary code for model training, and for basic hyperparameter search is defined in the ExperimentBase
class. A model can be trained via the .train_model()
method. If a hyperparameter space is provided to this method, then training includes a cross validated search through the hyperparemeter space. In the example above, the following snipit performs model training:
# ... then train the model...
trained_model = simple_experiment.train_model(
RandomForestClassifier(),
processed_datasets,
# params=model_algorithm.hyperparams, # If this is passed, then cross validation search is performed, but slow.
)
Hyperparameter tuning
By default, if a hyperparameter space is provided, A randomized search of 20 iterations with 5-fold cross validation will be performed. If using a classifier, the scoring function for cross validation will be f1_macro
, and for regression will be rmse
(taken as a negative for maximization).
Evaluation
Each of the {Classification, Regression}Experiment
classes define a relevant method for model evaluation, with the respective metrics. Each class type stores a dictionary of a metric_name -> metric_callable mapping, and evaluation simply operates for every key -> value pair in this dictionary. Custom metrics can be provided via the .add_metric()
method, however need to accept the labels, and predictions as the input ex: my_metric(labels: Iterable, predictions: Iterable)
Note: If predictions are passed as class probabilities, make sure your metric functions raises an error if only the predicted class is provided
Notes:
More detailed usage can be found in the examples and docstrings including:
examples/0_classification_example.ipynb
A notebook showing a number of applications ofmlexpy
.examples/1_scripted_example.py
Showing a rough outline for suggested file and modules structure usingmlexpy
.examples/2_store_all_models_example.py
Showing how to call the model i/o tooling to store trained models, and to load trained models generating identical predictions.examples/3_loading_models_example.py
Showing how to load stored model definitions to evaluate an existing method on "new" data.examples/4_initial_filtering_and_binary_class_example.py
Showing how to use the initial filtering tooling (mlexpy.utils.initial_filtering()
) and a case of binary classification.examples/5_cv_search_for_hyperpareters.py
Showing how to perform hyperparameter search, and an example of a model_definition file to store ML model definitions.examples/6_regression_pca_example.py
Showing how to perform a regression example, and how to add in a possible PCA method for dimensionality reduction.
Roadmap / TODOs:
- Expand to
numpy.ndarray
s? - Test and/or expand to general statistical prediction models. (beyond the
sklearn
framework) - Add in readthedocs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mlexpy-0.0.2.tar.gz
.
File metadata
- Download URL: mlexpy-0.0.2.tar.gz
- Upload date:
- Size: 32.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3488e18dc559610cf113c440bae3ef0942544248e6e9d5fc060793b1166affe7 |
|
MD5 | b086a8a9a1d1093fa272b81a2ff38488 |
|
BLAKE2b-256 | d147113018e897b3cd2ba8d7b82044d03ced73088d27ec8c608fe76101b66a40 |
File details
Details for the file mlexpy-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: mlexpy-0.0.2-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f83fdf67ce102d17ee5bbb9a1c59623e5350c78e0eb0a3a6f7c54a1273e14ec |
|
MD5 | 789a7184817eb6ba1b866389e00b0322 |
|
BLAKE2b-256 | 9ed07052783f5f8b39034f476bdbd40a46b3dbe059192a425cfd51b0972e2d90 |