Imputation for tables with missing values
Project description
DataWig - Imputation for Tables
DataWig learns models to impute missing values in tables.
For each to-be-imputed column, DataWig trains a supervised machine learning model to predict the observed values in that column using the data from other columns.
Dependencies
DataWig requires:
- Python3
- MXNet 1.3.0
- numpy
- pandas
- scikit-learn
Installation with pip
CPU
> pip3 install datawig
GPU
If you want to run DataWig on a GPU you need to make sure your version of Apache MXNet Incubating contains the GPU bindings. Depending on your version of CUDA, you can do this by running the following:
> wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt
> pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt
> rm requirements.gpu-cu${CUDA_VERSION}.txt
where ${CUDA_VERSION}
can be 75
(7.5), 80
(8.0), 90
(9.0), or 91
(9.1).
Running DataWig
The DataWig API expects your data as a pandas DataFrame. Here is an example of how the dataframe might look:
For most use cases, the SimpleImputer
class is the best starting point. DataWig expects you to provide the column name of the column you would like to impute values for (called output_column
below) and some column names that contain values that you deem useful for imputation (called input_columns
below).
from datawig import SimpleImputer
import pandas as pd
df_train = pd.read_csv('/path/to/train/data.csv')
df_test = pd.read_csv('/path/to/test/data.csv')
#Initialize a SimpleImputer model
imputer = SimpleImputer(
input_columns=['item_name', 'description'], #columns containing information about the column we want to impute
output_column='brand', #the column we'd like to impute values for
output_path = 'imputer_model' #stores model data and metrics
)
#Fit an imputer model on the train data
imputer.fit(train_df=df_train)
#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
In order to have more control over the types of models and preprocessings, the Imputer
class allows directly specifying all relevant model features and parameters.
For details on usage, refer to the provided examples.
Executing Tests
Clone the repository from git and set up virtualenv in the root dir of the package:
python3 -m venv venv
Install the package from local sources:
./venv/bin/pip install -e .
Run tests:
./venv/bin/pip install -r requirements/requirements.dev.txt
./venv/bin/python -m pytest
Acknowledgments
Thanks to David Greenberg for the package name.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.