Skip to main content

Imputation for tables with missing values

Project description

DataWig - Imputation for Tables

PyPI version GitHub license GitHub issues Build Status

DataWig learns models to impute missing values in tables.

For each to-be-imputed column, DataWig trains a supervised machine learning model to predict the observed values in that column using the data from other columns.

See our user-guide and extended documentation here.

Dependencies

DataWig requires:

  • Python3
  • MXNet 1.3.0
  • numpy
  • pandas
  • scikit-learn

Installation with pip

CPU

pip3 install datawig

GPU

If you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings. Depending on your version of CUDA, you can do this by running the following:

wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt
pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt
rm requirements.gpu-cu${CUDA_VERSION}.txt

where ${CUDA_VERSION} can be 75 (7.5), 80 (8.0), 90 (9.0), or 91 (9.1).

Running DataWig

The DataWig API expects your data as a pandas DataFrame. Here is an example of how the dataframe might look:

datawig dataframe example

For most use cases, the SimpleImputer class is the best starting point. DataWig expects you to provide the column name of the column you would like to impute values for (called output_column below) and some column names that contain values that you deem useful for imputation (called input_columns below).

    from datawig import SimpleImputer
    import pandas as pd

    df_train = pd.read_csv('/path/to/train/data.csv')
    df_test = pd.read_csv('/path/to/test/data.csv')

    #Initialize a SimpleImputer model
    imputer = SimpleImputer(
        input_columns=['item_name', 'description'], #columns containing information about the column we want to impute
        output_column='brand', #the column we'd like to impute values for
        output_path = 'imputer_model' #stores model data and metrics
        )

    #Fit an imputer model on the train data
    imputer.fit(train_df=df_train)

    #Impute missing values and return original dataframe with predictions
    imputed = imputer.predict(df_test)

In order to have more control over the types of models and preprocessings, the Imputer class allows directly specifying all relevant model features and parameters.

For details on usage, refer to the provided examples.

Executing Tests

Clone the repository from git and set up virtualenv in the root dir of the package:

python3 -m venv venv

Install the package from local sources:

./venv/bin/pip install -e .

Run tests:

./venv/bin/pip install -r requirements/requirements.dev.txt
./venv/bin/python -m pytest

Acknowledgments

Thanks to David Greenberg for the package name.

Building documentation

git clone git@github.com:awslabs/datawig.git
cd datawig/docs
make html
open _build/html/index.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
datawig-0.1.5-py3-none-any.whl (56.8 kB) Copy SHA256 hash SHA256 Wheel py3 Oct 11, 2018
datawig-0.1.5.tar.gz (43.3 kB) Copy SHA256 hash SHA256 Source None Oct 11, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page