Skip to main content

Preprocessing for ML models made easy.

Project description

Prep-ML

What is Prep-ML?

prep-ml is an open-source pre-processing library aimed at simplifying the data processing steps and streamlining the transformation techniques before feeding it to your choice of machine learning algorithm.


Why Prep-ML?

Production grade machine learning is quite different from the standard notebook building. Notebook building is aimed at fast development, interactive code, and visual feedback system. While the scripts aim to cater models to large groups of audience or companies.

For eg, consider one of the key features of your model is DATE_OF_BIRTH, in real-time, due to various database schemas, the feature could be available in any of its synonyms (say, DOB, BIRTH_DATE). This is where prep-ml tries to fill in, like a heavily inspired from ETL tools and other design patterns.


Installation

$ pip install prep-ml


Documentation

This is the schema currently supported by the library. This can take python dict or JSON string.

{
    "FEATURE_NAME": {
        "required": bool,
        "encoding": str,
        "alias": str,
        "imputation": str,
        "derived_eq": str
    }
}

Schema Definitions:

required: bool

accepted values: True, False

determines if the feature is required for the model.

  • If required is set to False, the FEATURE_NAME is discarded for further processing.

encoding: str

accepted values: label, ohe

performs the given encoding strategy on the FEATURE_NAME.

  • If encoding is set to "label", LabelEncoding or OrdinalEncoding is performed on the FEATURE_NAME
  • If encoding is set to "ohe", OneHotEncoding is performed on the FEATURE_NAME

alias: str

accepted values: any string

this is a synonym or alias for the given FEATURE_NAME.

  • For eg, If alias is set to "FEATURE_OTHER_NAME", the alias name will be mapped to the FEATURE_NAME

imputation: str

accepted values: mean, median, most_frequent

performs the given imputation strategy on the FEATURE_NAME. This is a wrapper of SimpleImputer.

  • If "mean", then replace missing values using the mean for the FEATURE_NAME. Can only be used with numeric data.
  • If "median", then replace missing values using the median for the FEATURE_NAME. Can only be used with numeric data.
  • If "most_frequent", then replace missing using the most frequent value for the FEATURE_NAME. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

derived_eq: str

accepted values: eval equation as a string

evaluated the given equation and then assigns the response to FEATURE_NAME. The reference to dataframe should be df

  • For eg consider the above feature DOB, If derived_eq is set to "pd.to_datetime(df.DOB, format='%m/%d/%Y')", the expression will be evaluated and assigned to FEATURE_NAME. Note that, df is reference to the provided input df.

Methods:

from_dict(mapper, dataframe) -- reads the dict and processes the input dataframe.

from_json(json_mapper, dataframe) -- reads the json and processes the input dataframe.

get_data() -- fetches the processed data.


Usage Example:

This is the input data

input_data

Data Explanation:

This is randomly generated data for the purposes of demo. All references are assumptions.

This is a company employee data. We have various features, which are self explanatory.

Ideally, we would want to remove the NAMES, as they are uniques and serve no purpose in model. Transform DOB to say a derived feature called AGE. Encode, GENDER, DESIGNATION and PROMOTED. Impute RATING.

So, on using the driver code.

from prep_ml.pre_processor import Prep
import pandas as pd

prep_ob = {
    "EMPLOYEE_ID": {
        "alias": "EMP_ID",
        "required": True
    },
    "FIRST_NAME": {
        "required": False,
    },
    "LAST_NAME": {
        "required": False,
    },
    "AGE": {
        "required": True,
        "alias": "DOB",
        "derived_eq": "(pd.Timestamp('now') - pd.to_datetime(df.AGE, format='%m/%d/%Y')).astype('<m8[Y]')"
    },
    "GENDER": {
        "required": True,
        "encoding": 'ohe'
    },
    "RATING": {
        "required": True,
        "imputation": 'most_frequent'
    },
    "DESIGNATION": {
        "required": True,
        "encoding": 'label'
    },
    "PROMOTED": {
        "required": True,
        "encoding": 'label'
    }
}

df = pd.read_csv('tests/MOCK_DATA.csv')

p = Prep.from_dict(prep_ob, df)
rdf = p.get_data()
print(rdf.columns.to_list())

Output

['EMPLOYEE_ID', 'DESIGNATION', 'PROMOTED', 'RATING', 'AGE', 'GENDER_1', 'GENDER_2', 'GENDER_3']

The output in dataviewer is as follows. readme_processed


Future Development Roadmap

  • Performance improvements.
  • Add support for more imputation and encoding strategies.
  • Support for feature scaling.
  • Support for multiple schemas.
  • Support for multiple input sources.
  • Support for enforcing column types.
  • Feasibility for model training.

Changelog

2nd May, 2021 :: v0.1.0:

  • This is a very early dev version. This further needs development and code optimization.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prep-ml-0.1.1.tar.gz (6.2 kB view hashes)

Uploaded Source

Built Distribution

prep_ml-0.1.1-py3-none-any.whl (5.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page