Skip to main content

Imputation for tables with missing values

Project description

DataWig - Imputation for Tables

DataWig learns models to impute missing values in tables.

For each to-be-imputed column, DataWig trains a supervised machine learning model to predict the observed values in that column from the values in other columns

Installation

The easiest way to install the package is to use virtualenvs and pip.

Set up virtualenv in the root dir of the package:

python3.6 -m venv venv

Install the package

./venv/bin/pip install -e .

Run tests:

./venv/bin/pip install -r requirements/requirements.dev.txt
./venv/bin/python -m pytest

Usage

The imputation API is expecting your data as a pandas DataFrame.

For most use cases, the SimpleImputer class is the best starting point. DataWig expects you to provide the column name of the column you would like to impute values for (called output_column below) and some column names that contain values that you deem useful for imputing the values in the to-be-imputed column (called input_columns below).

   from datawig import SimpleImputer
   import pandas as pd

   # some test data stored in the test/resources folder

   df_train = pd.read_csv("training_data.csv")
   df_test = pd.read_csv("testing_data_files.csv")

   # this is where the model artifacts and metrics will be stored
   output_path = "imputer_model"

   # Initialize and train Imputer
   imputer = SimpleImputer(
       input_columns=["item_name", "bullet_point"], # columns containing information about the column we want to impute
       output_column="brand" # the column we'd like to impute values for
       ).fit(train_df=df_train)

   # Impute missing values on test data
   imputed = imputer.predict(df_test)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datawig-0.0.1.tar.gz (38.8 kB view hashes)

Uploaded Source

Built Distribution

datawig-0.0.1-py3-none-any.whl (36.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page