Skip to main content

data science preprocessing toolkit and container

Project description

dstkc

data science toolkit and storage container

This toolkit is meant to be a storage container that organizes your data and data science models so that it is easy to work with. This class acts as both a pre-processor, as well as a storage container

:param df: Pandas dataframe

:param model: data science model

:param y_col: column in df that contains dependent variable

:param x_cols: columns in df that contain independent variables

:param train_test_split_params: parameters for sci-kit learn's train test split function see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html for reference

other attributes include:

:attr train_data - pandas dataframe with all training data

:attr test_data - pandas dataframe with all testing data

:attr x_data - pandas dataframe with all x data

:attr y_data - pandas dataframe with all y data

:attr x_test - pandas dataframe with all x testing data

:attr x_train - pandas dataframe with all x training data

:attr x_test_array - numpy array with all x testing data

:attr x_train_array - numpy array with all x training data

:attr y_test - pandas dataframe with all y testing data

:attr y_train - pandas dataframe with all y training data

:attr y_test_array - numpy array with all y testing data

:attr y_train_array - numpy array with all y training data

:attr model - store your model here for later use

:attr predictions - store your model's predictions here

:attr score - store a scoring or performance metric here

:attr notes - place for you to store any and all notes

:attr misc_container - dict style container for storing anything else you might need

here is an example of a few use cases

import pandas as pd
from operator import attrgetter
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

from dstkc.tkc import DataScienceToolKit

here we read in the iris data set (because it's a classic)

def example_main():
    df = pd.read_csv(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
        names=[
             'sepal_length',
             'sepal_width',
             'petal_length',
             'petal_width',
             'species'
        ]
    )

we need to handle a bit of data pre processing, currently the toolkit doesn't handle for nulls or string->numeric

    df['species'] = df['species'].apply(
        lambda x: 0 if x == 'Iris-setosa' else 1 if x == 'Iris-versicolor' else 2
    )

there are two useful cases for this toolkit as it stands, one is cycling through columns if you're unsure what combination of columns to use

    all_x_cols = df.columns[:-1]
    y_col = df.columns[-1:]

here we are going to try different combinations of columns, and store the information. note how there is no mention of data processing other than our data cleaning with our dataframe

not technically useful in this instance, but naming the model will be something to revisit in the future

    toolkit_storage_container = []
    for i in range(1, len(all_x_cols)):
        model_name = 'knn'

        x_cols = all_x_cols[i:]

the class uses train test split from sklearn, the final argument are the parameters for the function call see: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html for details

        dstk = DataScienceToolKit(
            df=df,
            model=KNeighborsClassifier(),
            y_col=y_col,
            x_cols=x_cols,
            train_test_split_params={
                'test_size': 0.3
            }
        )

each container can store notes, among other handy values, there is also a miscellaneous container which acts as a dictionary in case you want to have something else float around with all of your data and models

and just like that, you have all of your data ready to go, in one place!

        dstk.notes = model_name

        print(dstk.x_train)  # here is your data as a dataframe for inspection, debugging
        print(dstk.x_train_array)  # here is your dat as an array for your modeling

        print(dstk.y_train)  # here is your data as a dataframe for inspection, debugging
        print(dstk.y_train_array)  # here is your dat as an array for your modeling

        print(dstk.x_test)  # here is your data as a dataframe for inspection, debugging
        print(dstk.x_test_array)  # here is your dat as an array for your modeling

        print(dstk.y_test)  # here is your data as a dataframe for inspection, debugging
        print(dstk.y_test_array)  # here is your dat as an array for your modeling

please note that we fit and score the model using the model's native features so we can use any model, this is not only for scikit learn

        dstk.model.fit(dstk.x_train_array, dstk.y_train_array)

we can also store the predictions and score of the model, in any fashion or form you would like

        dstk.predictions = dstk.model.predict(dstk.x_test_array)
        dstk.score = dstk.model.score(
            dstk.x_test_array, dstk.y_test_array
        )

        # we're gonna store this for later, this is where the real use case comes in
        toolkit_storage_container.append(dstk)

now that we have finished iterating over a bunch of different column sets, maybe we want to know which had the best performance

    best_dstk = max(toolkit_storage_container, key=attrgetter('score'))

now we may want to know what the columns were, or inspect that dataset outside of our code

    print(best_dstk.x_cols)
    # best_dstk.train_data.to_csv('./train_data.csv')
    # best_dstk.test_data.to_csv('./test_data.csv')

most importantly, all of the data, along with the model, performance information, predictions, arrays, dataframes, and anything else are now sitting together, and you can use any feature of the model to inspect, or use any aspect of the work we have done, without altering any of your prior code

another use case is comparing model performance, not just column set performance (now names/notes become more important)

    model_dict = {
        'sgdc': SGDClassifier(),
        'gauss': GaussianNB(),
        'knn': KNeighborsClassifier(),
        'dtc': DecisionTreeClassifier(),
        'rfc': RandomForestClassifier(),
    }

    # re-initializing this for new example
    toolkit_storage_container = []

    for key_, value_ in model_dict.items():
        model_name = key_
        new_model = value_

note that in this instance, we are not declaring x cols, because we are going to use all columns other than the y_col, and that is default behaviour for the toolkit also, if the y col was in the first position of the dataframe, it would not have to be specified either

        dstk = DataScienceToolKit(
            df=df,
            model=new_model,
            y_col=y_col,
            train_test_split_params={
                'test_size': 0.3
            }
        )

        dstk.notes = model_name

        dstk.model.fit(
            dstk.x_train_array, dstk.y_train_array
        )
        dstk.predictions = dstk.model.predict(
            dstk.x_test_array
        )
        dstk.score = dstk.model.score(
            dstk.x_test_array, dstk.y_test_array
        )

here is an example of using the miscellaneous container to store a confusion matrix for later

        dstk.misc_container['confusion_matrix'] = confusion_matrix(
            dstk.y_test_array, dstk.predictions
        )

        toolkit_storage_container.append(dstk)

this time around, we want to know which model had the best score

    best_dstk = max(toolkit_storage_container, key=attrgetter('score'))

and yet again, we have all of the relevant information, like which model was best, what the performance was we also have all the models, in case we want to test or compare any aspect of them further more, one can combine the two cases and iterate over column sets, and model choices to quickly hone in on interesting data points, without having to clean the data or split the data and store the data in any way that usually causes (well at least for me) any headaches

    print(best_dstk.notes)
    print(best_dstk.score)
    print(best_dstk.misc_container['confusion_matrix'])

    print(
        '''
        hope you enjoy, and find this useful! 
        '''
    )


if __name__ == '__main__':
    example_main()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dstkc-0.1.4.tar.gz (17.4 kB view hashes)

Uploaded Source

Built Distribution

dstkc-0.1.4-py3-none-any.whl (17.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page