Skip to main content

Adversarial validation for train-test datasets

Project description

The code is Python 3

What is Adversarial Validation? The objective of any predictive modelling project is to create a model using the training data, and afterwards apply this model to the test data. However, for the best results it is essential that the training data is a representative sample of the data we intend to use it on (i.e. the test data), otherwise our model will, at best, under-perform, or at worst, be completely useless.

*Adversarial Validation* is a very clever and very simple way to let us know if our test data and our training data are similar; we combine our train and test data, labeling them with say a 0 for the training data and a 1 for the test data, mix them up, then see if we are able to correctly re-identify them using a binary classifier.

If we cannot correctly classify them, i.e. we obtain an area under the [receiver operating characteristic curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (ROC) of 0.5 then they are indistinguishable and we are good to go.

However, if we can classify them (ROC > 0.5) then we have a problem, either with the whole dataset or more likely with some features in particular, which are probably from different distributions in the test and train datasets. If we have a problem, we can look at the feature that was most out of place. The problem may be that there were values that were only seen in, say, training data, but not in the test data. If the contribution to the ROC is very high from one feature, it may well be a good idea to remove that feature from the model.

Adversarial Validation to reduce overfitting The key to avoid overfitting is to create a situation where the local cross-vlidation (CV) score is representative of the competition score. When we have a ROC of 0.5 then your local data is representative of the test data, thus your local CV score should now be representative of the Public LB score.

Procedure:

  • drop the training data target column

  • label the test and train data with 0 and 1 (it doesn’t really matter which is which)

  • combine the training and test data into one big dataset

  • perform the binary classification, for example using XGboost

  • look at our AUC ROC score

Installation

Fast install:

pip install adval

Example on Mobile Price Classification Dataset

from validation.adval import adVal

# In this dataset:
# target = "price_range"
# Id Column = "id"
# run module
k = adVal(train, test,  "price_range", "id")

# get auc_score
k.auc_score()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adval-0.0.5.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

adval-0.0.5-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file adval-0.0.5.tar.gz.

File metadata

  • Download URL: adval-0.0.5.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.3 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.4

File hashes

Hashes for adval-0.0.5.tar.gz
Algorithm Hash digest
SHA256 86e5c160048f41496d63d7da645ea0840ebe67b2529daab2d2ea2e9b42d30877
MD5 551667335e7c8e9db31c628826f68e9b
BLAKE2b-256 2402c8c3c30c2399a9ab3f04ff85fd5d7815cd0e81bf438eee1201449550e8ac

See more details on using hashes here.

File details

Details for the file adval-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: adval-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.3 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.4

File hashes

Hashes for adval-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 124063bc36b2d216fe14e4beff28a1234f6cac87171f28871bff4334a03c1ed8
MD5 582c99cbc6c43066a9ba9d659736fde3
BLAKE2b-256 c49818f75b02e6232c3ef6b8ea0940a9e92babd88c5b83153d8990e92821ba83

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page