Skip to main content

vtreat is a pandas.DataFrame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.

Project description

This is the Python version of the vtreat data preparation system (also available as an R package).

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.

vtreat takes an input DataFrame that has a specified column called "the outcome variable" (or "y") that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/string-valued, these columns may have missing values) that the user later wants to use to predict "y". In practice such an input DataFrame may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.

To solve this, vtreat builds a transformed DataFrame where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The vtreat implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified "y" or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed DataFrame is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.

The idea is: you can take a DataFrame of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using vtreat. Incorporating vtreat into your machine learning workflow lets you quickly work with very diverse structured data.

Worked examples can be found here.

For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the R version, however all of the examples can be found worked in Python here).

vtreat is available as a Python/Pandas package, and also as an R package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vtreat-0.5.0.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

vtreat-0.5.0-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file vtreat-0.5.0.tar.gz.

File metadata

  • Download URL: vtreat-0.5.0.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.4

File hashes

Hashes for vtreat-0.5.0.tar.gz
Algorithm Hash digest
SHA256 4dfd6403ddee3a85afef6a47aad07d4d724129818f444ff39466e0688c11d172
MD5 bb084f8743c8c8241fd4128f181af2be
BLAKE2b-256 6b322ca8e4735c5762c804b7192b160aa2af41c77e589c51ef7f487707b8a2f1

See more details on using hashes here.

File details

Details for the file vtreat-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: vtreat-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.4

File hashes

Hashes for vtreat-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8221eb238381e4b8ae78a2bf5da5d5594b73587e81d357a65972fecc67962b4
MD5 f885d52b97ee94778fc254f6e91257f6
BLAKE2b-256 26241fdb7dfb976a449216b7242617378ca1db6a43364a4c2099677ba48045e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page