Skip to main content

A framework to organize the process of designing supervised machine learning systems

Project description

A framework to ease the burden of organizing code of a supervised machine learning system.

It provides decorators that manage data & pass it between common steps in building a machine learning system, such as: - loading the dataset - preprocessing - feature generation - model definition

While doing this, it keeps the global namespace free of clutter such as that from an endless chain of features and models.

In addition, it makes it easy to put new, real life, data through the exact same process that training data goes through.

Installation

Install simply via pip (Python 3):

$ pip install stickbugml

Dependencies: - Python 3 - sklearn - pandas - numpy

Example

Note: there is also a great example for use in Jupyter Notebooks

First, import this library:

import stickbugml
from stickbugml.decorators import dataset, feature, model

Load your dataset:

import seaborn.apionly as sns
import pandas as pd

@dataset(train_valid_test=(0.6, 0.2, 0.2)) # define your train/test/validation data splits
def raw_dataset():
    titanic_dataset = sns.load_dataset('titanic')

    # Drop NaN rows for simplicity
    titanic_dataset.dropna(inplace=True)

    # Extract X and y
    X = titanic_dataset.drop('survived', axis=1)
    y = titanic_dataset['survived']
    return X, y

print(raw_dataset.head()) # yes, this does work! raw_dataset is now a pandas DataFrame

(Optionally) do some pre-processing:

@preprocess
def preprocessed_dataset(X):
    # Encode categorical columns
    categorical_column_names = [
            'sex', 'embarked', 'class',
            'who', 'adult_male', 'deck',
            'embark_town', 'alive', 'alone']

    X = pd.get_dummies(X,
                       columns=categorical_column_names,
                       prefix=categorical_column_names)

    return X

print(preprocessed_dataset.head()) # See the first code block for explaination

Generate some features:

from sklearn import decomposition
import numpy as np

@feature('pca')
def pca_feature(X):
    pca = decomposition.PCA(n_components=3)
    pca.fit(X)
    pca_out = pca.transform(X)

    pca_out = np.transpose(pca_out, (1, 0))
    return pd.DataFrame(pca_out)

# let's preview
print(pca_feature.head()) # See the first code block for explaination

# you can add more features, btw

And define your (machine learning) model(s):

import xgboost as xgb

@model('xgboost')
def xgboost_model():
    def define(num_columns):
        return None # xgboost models aren't pre-defined


    def train(model, params, train, validation):
        params['objective'] = 'binary:logistic' # Static parameters can be defined here
        params['eval_metric'] = 'logloss'

        d_train = xgb.DMatrix(train['X'], label=train['y'])
        d_valid = xgb.DMatrix(validation['X'], label=validation['y'])

        watchlist = [(d_train, 'train'), (d_valid, 'valid')]

        trained_model = xgb.train(params, d_train, 2000, watchlist, early_stopping_rounds=50, verbose_eval=10)

        return trained_model

    def predict(model, X):
        return model.predict(xgb.DMatrix(X))

    return define, train, predict

Now you can train your model, trying out different parameters if you want:

stickbugml.train('xgboost', {
    'max_depth': 7,
    'eta': 0.01
})

The library keeps the test data’s ground truth values locked away so your models won’t train on it. After you train your model, have the framework evaluate it for you:

logloss_score = stickbugml.evaluate('xgboost')
print(logloss_score)

You can add more models and features if so desired.

Since this library is built with reality in mind, you can easily get predictions for new/real-life data:

raw_X = pd.read_csv('2018_titanic_manifesto.csv') # It will probably sink, but we don't know who will survive
processed_X = stickbugml.process(raw_X) # Process the data
del raw_X # Gotta keep that namespace clean, right?

y = stickbugml.predict('xgboost', processed_X) # Make predictions

print(y)

Contributing & Feedback

If you have any problems, or would like a new feature, submit an Issue.

If you want to help out, feel free to submit a Pull Request.

License

This project uses the Apache 2.0 License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stickbugml-1.0.4.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stickbugml-1.0.4-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file stickbugml-1.0.4.tar.gz.

File metadata

  • Download URL: stickbugml-1.0.4.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for stickbugml-1.0.4.tar.gz
Algorithm Hash digest
SHA256 02b9525b7304d470841943886c9469aba6e672c7d690f7648f3af685f36e8f51
MD5 1a2a3fffdfc218264a91c13a8bc51268
BLAKE2b-256 bbc26683f8588c755f50c22471d91006478f05153518ad5c367c2345f6432e59

See more details on using hashes here.

File details

Details for the file stickbugml-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for stickbugml-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 32ada768dac15ac3765a6a2e5e00dc513f3c8ad194b3b619ee3c4df6852438f4
MD5 d181dff9e705bf26a2e323735030bc9b
BLAKE2b-256 b02fd56affa33dbc9b158fc2a48dd25ddef41bb0375eb01c023259db4e1a320c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page