Skip to main content

A low-code library for machine learning pipelines

Project description

BlitzML

Automate machine learning pipelines rapidly

Install BlitzML

pip install blitzml

Classification

from blitzml.tabular import Classification
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("auxiliary/datasets/banknote/train.csv")
test_df = pd.read_csv("auxiliary/datasets/banknote/test.csv")

# create the pipeline
auto = Classification(train_df, test_df, classifier = 'RF', n_estimators = 50)

# first perform data preprocessing
auto.preprocess()
# second train the model
auto.train_the_model()

# After training the model we can generate:
auto.gen_pred_df(auto.test_df)
auto.gen_metrics_dict()

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict

print(pred_df.head())
print(metrics_dict)

Available Classifiers

  • Random Forest 'RF'
  • LinearDiscriminantAnalysis 'LDA'
  • Support Vector Classifier 'SVC'
  • KNeighborsClassifier 'KNN'
  • GaussianNB 'GNB'
  • LogisticRegression 'LR'
  • AdaBoostClassifier 'AB'
  • GradientBoostingClassifier 'GB'
  • DecisionTreeClassifier 'DT'
  • MLPClassifier 'MLP'

Parameters

classifier
options: {'RF','LDA','SVC','KNN','GNB','LR','AB','GB','DT','MLP', 'auto', 'custom'}, default = 'RF'
auto: selects the best scoring classifier based on f1-score
custom: enables providing a custom classifier through *file_path* and *class_name* parameters
file_path
when using 'custom' classifier, pass the path of the file containing the custom class, default = 'none'
class_name
when using 'custom' classifier, pass the class name through this parameter, default = 'none'
feature_selection
options: {'correlation', 'importance', 'none'}, default = 'none'
correlation: use feature columns with the highest correlation with the target
importance: use feature columns that are important for the model to predict the target
none: use all feature columns
validation_percentage
value determining the validation split percentage (value from 0 to 1), default = 0.1
average_type
when performing multiclass classification, provide the average type for the resulting metrics, default = 'macro'
cross_validation_k_folds
number of k-folds for cross validation, if 1 then no cv will be performed, default = 1
**kwargs
optional parameters for the chosen classifier. you can find available parameters in the sklearn docs

Attributes

train_df
the preprocessed train dataset (after running Classification.preprocess())
test_df
the preprocessed test dataset (after running Classification.preprocess())
model
the trained model (after running Classification.train_the_model())
pred_df
the prediction dataframe (test_df + predicted target) (after running Classification.gen_pred_df(Classification.test_df))
metrics_dict
the validation metrics (after running Classification.gen_metrics_dict())
{
    "accuracy": acc,
    "f1": f1,
    "precision": pre,
    "recall": recall,
    "hamming_loss": h_loss,
    "cross_validation_score":cv_score, returns None if cross_validation_k_folds==1
}

Methods

preprocess()
perform preprocessing on train_df and test_df
train_the_model()
train the chosen classifier on the train_df
accuracy_history()
accuracy scores when varying the sampling size of the train_df (after running Classification.train_the_model()).
returns:
{
    'x':train_df_sample_sizes,
    'y1':train_scores_mean,
    'y2':test_scores_mean,
    'title':title
}
gen_pred_df(test_df)
generates the prediction dataframe and assigns it to the pred_df attribute
gen_metrics_dict()
generates the validation metrics and assigns it to the metrics_dict
run()
a shortcut that runs the following methods:
preprocess()
train_the_model()
gen_pred_df(Classification.test_df)
gen_metrics_dict()

Regression

from blitzml.tabular import Regression
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("auxiliary/datasets/house prices/train.csv")
test_df = pd.read_csv("auxiliary/datasets/house prices/test.csv")

# create the pipeline
auto = Regression(train_df, test_df, regressor = 'RF')

# first perform data preprocessing
auto.preprocess()
# second train the model
auto.train_the_model()

# After training the model we can generate:
auto.gen_pred_df(auto.test_df)
auto.gen_metrics_dict()

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict

print(pred_df.head())
print(metrics_dict)

Available Regressors

  • Random Forest 'RF'
  • Support Vector Regressor 'SVR'
  • KNeighborsRegressor 'KNN'
  • Lasso Regressor 'LSS'
  • LinearRegression 'LR'
  • Ridge Regressor 'RDG'
  • GaussianProcessRegressor 'GPR'
  • GradientBoostingRegressor 'GB'
  • DecisionTreeRegressor 'DT'
  • MLPRegressor 'MLP'

Parameters

regressor
options: {'RF','SVR','KNN','LSS','LR','RDG','GPR','GB','DT','MLP', 'auto', 'custom'}, default = 'RF'
auto: selects the best scoring regressor based on r2 score
custom: enables providing a custom regressor through *file_path* and *class_name* parameters
file_path
when using 'custom' regressor, pass the path of the file containing the custom class, default = 'none'
class_name
when using 'custom' regressor, pass the class name through this parameter, default = 'none'
feature_selection
options: {'correlation', 'importance', 'none'}, default = 'none'
correlation: use feature columns with the highest correlation with the target
importance: use feature columns that are important for the model to predict the target
none: use all feature columns
validation_percentage
value determining the validation split percentage (value from 0 to 1), default = 0.1
cross_validation_k_folds
number of k-folds for cross validation, if 1 then no cv will be performed, default = 1
**kwargs
optional parameters for the chosen regressor. you can find available parameters in the sklearn docs

Attributes

train_df
the preprocessed train dataset (after running Regression.preprocess())
test_df
the preprocessed test dataset (after running Regression.preprocess())
model
the trained model (after running Regression.train_the_model())
pred_df
the prediction dataframe (test_df + predicted target) (after running Regression.gen_pred_df(Regression.test_df))
metrics_dict
the validation metrics (after running Regression.gen_metrics_dict())
{
    "r2_score": r2,
    "mean_squared_error": mse,
    "root_mean_squared_error": rmse,
    "mean_absolute_error" : mae,
    "cross_validation_score":cv_score, returns None if cross_validation_k_folds==1
}

Methods

preprocess()
perform preprocessing on train_df and test_df
train_the_model()
train the chosen regressor on the train_df
RMSE_history()
RMSE scores when varying the sampling size of the train_df (after running Regression.train_the_model()).
returns:
{
    'x':train_df_sample_sizes,
    'y1':train_scores_mean,
    'y2':test_scores_mean,
    'title':title
}
gen_pred_df(test_df)
generates the prediction dataframe and assigns it to the pred_df attribute
gen_metrics_dict()
generates the validation metrics and assigns it to the metrics_dict
run()
a shortcut that runs the following methods:
preprocess()
train_the_model()
gen_pred_df(Regression.test_df)
gen_metrics_dict()

Time-series

time series is a particular problem of Regression, but time series have some additional functions:

  • stationary test (IsStationary()).
  • convert to stationary.
  • reverse predicted.

and the dataset must have a DateTime column, even if the DataType of this column is Object.

from blitzml.tabular import TimeSeries 
import pandas as pd

# prepare your dataframes
train_df = pd.read_csv("train_dataset.csv")
test_df = pd.read_csv("test_dataset.csv")

# create the pipeline
auto = TimeSeries(train_df, test_df, regressor = 'RF')

# Perform the entire process:
auto.run()

# We can get their values using:
pred_df = auto.pred_df
metrics_dict = auto.metrics_dict

print(pred_df.head())
print(metrics_dict)

Clustering

from blitzml.unsupervised import Clustering
import pandas as pd

# prepare your dataframe
train_df = pd.read_csv("auxiliary/datasets/customer personality/train.csv")

# create the pipeline
auto = Clustering(train_df, clustering_algorithm = 'KM')

# first perform data preprocessing
auto.preprocess()
# second train the model
auto.train_the_model()

# After training the model we can generate:
auto.gen_pred_df()
auto.gen_metrics_dict()

# We can get their values using:
print(auto.pred_df.head())
print(auto.metrics_dict)

Available Clustering Algorithms

  • K-Means 'KM'
  • Affinity Propagation 'AP'
  • Agglomerative Clustering 'AC'
  • Mean Shift 'MS'
  • Spectral Clustering 'SC'
  • Birch 'Birch'
  • Bisecting K-Means 'BKM'
  • OPTICS 'OPTICS'
  • DBSCAN 'DBSCAN'

Parameters

clustering_algorithm
options: {"KM", "AP", "AC", "MS", "SC", "Birch", "BKM", "OPTICS", "DBSCAN", 'auto', 'custom'}, default = 'KM' auto: selects the best scoring clustering algorithm based on silhouette score custom: enables providing a custom clustering algorithm through *file_path* and *class_name* parameters file_path when using 'custom' clustering_algorithm, pass the path of the file containing the custom class, default = 'none'
class_name when using 'custom' clustering_algorithm, pass the class name through this parameter, default = 'none' feature_selection options: {'importance', 'none'}, default = 'none' importance: use feature columns that are important for the model to predict the target none: use all feature columns **kwargs optional parameters for the chosen clustering_algorithm. you can find available parameters in the sklearn docs

Attributes

train_df the preprocessed train dataset (after running Clustering.preprocess())
model the trained model (after running Clustering.train_the_model()) pred_df the prediction dataframe (test_df + predicted target) (after running Clustering.gen_pred_df()) metrics_dict the validation metrics (after running Clustering.gen_metrics_dict()) {     "silhouette_score": sil_score,     "calinski_harabasz_score": cal_har_score,     "davies_bouldin_score": dav_boul_score,     "n_clusters" : n }

Methods

preprocess() perform preprocessing on train_df
train_the_model() train the chosen clustering algorithm on the train_df clustering_visualization() 2-d visualization of the data points with its corresponding labels (after doing dimensionality reduction using Principal Componenet Analysis). returns: {     'principal_component_1':pc1,     'principal_component_2':pc2,     'cluster_labels':labels,     'title':title } gen_pred_df() generates the prediction dataframe and assigns it to the pred_df attribute gen_metrics_dict() generates the clustering metrics and assigns it to the metrics_dict
run() a shortcut that runs the following methods: preprocess() train_the_model() gen_pred_df() gen_metrics_dict()

Development

  • Clone the repo
  • run pip install virtualenv
  • run python -m virtualenv venv
  • run . ./venv/bin/activate on UNIX based systems or . ./venv/Scripts/activate.ps1 if on windows
  • run pip install -r requirements.txt
  • run pre-commit install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blitzml-0.17.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

blitzml-0.17.0-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file blitzml-0.17.0.tar.gz.

File metadata

  • Download URL: blitzml-0.17.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for blitzml-0.17.0.tar.gz
Algorithm Hash digest
SHA256 bbbbf9c759faf2b544635457c2dc30fc05c0e1a7f5d36bde3026671becf5fe7c
MD5 e0591df0eb0b5499b952d49794ee3ed7
BLAKE2b-256 5b28cc54a36ecedff33977d8f378a2a5c76d1c7e5cd963abf38cd251203b0ce5

See more details on using hashes here.

File details

Details for the file blitzml-0.17.0-py3-none-any.whl.

File metadata

  • Download URL: blitzml-0.17.0-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.4

File hashes

Hashes for blitzml-0.17.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51f2d81a93adfc95cde5d22e49ef724894a61263e676a3c838e20e419ff274d1
MD5 eabd4536b3d1fcbe60fdb3321a0da952
BLAKE2b-256 a143b45e70d145ae60e7179c646bdf4549b1c6fa588bad9d2fcc356322d83190

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page