Skip to main content

Solve a different task which could help solve the main task

Project description

task_substitution

Solve an auxiliary task using ML.

This library was created by using nbdev, please check it out.

Task Substitution is a method of solving an auxiliary problem ( with different features and different target ) in order to better understand the initial problem and solving it efficiently.

Let's take a look at standard machine learning task, in the figure below you see a regression task with features f1, f2, f3 and target variable y.

We want to build on a model on the above dataset to predict for unknown y values in the test dataset shown below.

Exploratory Data Analysis

First step we take to solve the problem is to look at the data, there can be many features with missing values or outliers which needs to be understood. It is possible that there is a relationship between a missing value and values of other features.

Recover Missing Values

It is possible for a feature to have a missing value, it could be a data recording issue or bug etc. Often times for numerical features we replace missing value with mean or median value as a approximation. Sometimes we replace missing value with values like -9999 so that model treats them differently or sometimes we leave them as is as libraries like xgboost and lightgbm can handle nulls. Let's look at following dataset

Here we have a feature f3 with missing values, this is a numerical feature, what we can do is that we can consider f3 as target feature and reframe this as regresion problem where we try to predict for missing values.

The above setup is identical to the original regression task, here we would build a model to use f1 and f2 to predict for f3. So instead of using mean, median etc. we can build a model to restore missing values which can help us solve the original problem efficiently.

We have to be careful to not overfit when building such models.

Check Train Test Distributions

Whenever we train a model we want to use it on a new sample, but what if the new sample comes from a different distribution compared to the data on which the model was trained on. When we deploy our solutions on production we want to be very careful of this change as our models would fail if there is a mismatch in train and test sets. We can pose this problem as an auxiliary task and create a new binary target y, where 1 represents whether row comes from test set and 0 represents whether it comes from train set and then we train our model to predict whether a row comes from train or test set if the performance ( e.g. AUC score ) is high we can conclude that the train and test set come from different distributions. Ofcourse, we need to remove the original target from the analysis.

In the above images you can see two different datasets, we want to verify whether these two come from same distributions or not.

Consider the first example set as training set and second one as test set for this example.

We create a new target called is_test which denotes whether a row belongs to test set or not.

Then we combine training and test set and train a model to predict whether a row comes from train or test set, if our model performs well then we know that these two datasets are from different distributions.

We would still have to dig deep into looking at whether that's the case but the above method can help identifying which features are have drifted apart in train and test datasets. If you look at feature importance of the model that was used to separated train and test apart you can identify such features.

Install

For an editable install, use the following:

git clone https://github.com/numb3r33/task_substitution.git
pip install -e task_substitution

How to use

Recover Missing Values

Currently we only support missing value recovery for numerical features, we plan to extend support for other feature types as well. Also the model currently uses LightGBM model to recover missing values.

from task_substitution.recover_missing import *
from sklearn.metrics import mean_squared_error

train = train.drop('original_target', axis=1)

model_args = {
          'objective': 'regression',
          'learning_rate': 0.1,
          'num_leaves': 31,
          'min_data_in_leaf': 100,
          'num_boost_round': 100,
          'verbosity': -1,
          'seed': 41
             }

split_args = {
    'test_size': .2,
    'random_state': 41
}

# target_fld: feature with missing values.
# cat_flds: categorical features in the original dataset.
# ignore_flds: features you want to ignore. ( these won't be used by LightGBM model to recover missing values)

rec = RecoverMissing(target_fld='f3',
                     cat_flds=[],
                     ignore_flds=['f2'],
                     perf_fn=lambda tr,pe: np.sqrt(mean_squared_error(tr, pe)),
                     split_args=split_args,
                     model_args=model_args
                    )

train_recovered = rec.run(train)

Check train test distributions

We use LightGBM model to predict whether a row comes from test or train distribution.

import lightgbm as lgb
from task_substitution.train_test_similarity import *
from sklearn.metrics import roc_auc_score

train = train.drop('original_target', axis=1)

split_args = {'test_size': 0.2, 'random_state': 41}

model_args = {
    'num_boost_round': 100,
    'objective': 'binary',
    'learning_rate': 0.1,
    'num_leaves': 31,
    'nthread': -1,
    'verbosity': -1,
    'seed': 41
}

# cat_flds: categorical features in the original dataset.
# ignore_flds: features you want to ignore. ( these won't be used by LightGBM model )

tts = TrainTestSimilarity(cat_flds=[], 
                          ignore_flds=None,
                          perf_fn=roc_auc_score,
                          split_args=split_args, 
                          model_args=model_args)
tts.run(train, test)

# to get feature importance
fig, ax = plt.subplots(1, figsize=(16, 10)
lgb.plot_importance(tts.trained_model, ax=ax, max_num_features=5, importance_type='gain')

Contributing

If you want to contribute to task_substitution please refer to contributions guidelines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

task_substitution-0.0.1.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

task_substitution-0.0.1-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file task_substitution-0.0.1.tar.gz.

File metadata

  • Download URL: task_substitution-0.0.1.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for task_substitution-0.0.1.tar.gz
Algorithm Hash digest
SHA256 6ab7eef01d6e095cce6431f58c8170d4f80544386b5ca5e569573ac9556fa8b7
MD5 10f5eb8dc261dd7001708f87cd4b40f8
BLAKE2b-256 bb60e09abf6f5af34789ea85a71993ca92d32c707f5ffb0d26490507f96b55f6

See more details on using hashes here.

File details

Details for the file task_substitution-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: task_substitution-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for task_substitution-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df93579d19d070cff9233de1672880efa8ed410774cc0245bb92845051cf7c86
MD5 e77c32a2b24e09ee35573911e2e0c71d
BLAKE2b-256 775b553a7750f09c2b75322c9a898ba5498a679935565911fe7ffa79ef273cca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page