Skip to main content

Extension of fastai.tabular for larger-than-memory datasets with Dask

Project description

BigTabular

This library replicates much the functionality of the tabular data application in the fastai library to work with larger-than-memory datasets. Pandas, which is used for data transformations in fastai.tabular, is replaced with Dask DataFrames.

Most of the Dask implementations were written as they were needed for a personal project, but then refactored to match the fastai API more closely. The flow of the Jupyter notebooks follows those from fastai.tabular closely and most of the examples and tests were replicated.

When not to use BigTabular

Don’t use this library when you don’t need to use Dask. The Dask website gives the following guidance:

Dask DataFrames are often used either when …

  1. Your data is too big
  2. Your computation is too slow and other techniques don’t work

You should probably stick to just using pandas if …

  1. Your data is small
  2. Your computation is fast (subsecond)
  3. There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.

Install

pip install bigtabular

How to use

Refer to the tutorial for a more detailed usage example.

Get a Dask DataFrame:

path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
ddf.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse <NA> Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced <NA> Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

Create dataloaders. Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods:

dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])

Create a Learner:

learn = dask_learner(dls, metrics=accuracy)

Train the model for one epoch:

learn.fit_one_cycle(1)
epoch train_loss valid_loss accuracy time
0 0.359618 0.356699 0.836550 00:51

We can then have a look at some predictions:

learn.show_results()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary salary_pred
0 5 13 1 5 2 5 1 0.402007 0.446103 1.537303 1 1
1 5 16 1 0 3 5 1 0.768961 -1.380161 -0.033487 0 0
2 5 12 5 7 4 3 1 -0.919026 5.286263 -0.426185 0 0
3 5 16 5 13 2 3 2 0.181835 -0.467029 -0.033487 0 0
4 5 13 5 5 2 5 2 -0.698853 -0.308706 -0.033487 0 0
5 5 10 3 0 1 5 2 0.255226 -1.457680 -0.033487 1 1
6 1 10 3 1 1 5 2 2.016603 -0.117934 -0.033487 1 0
7 3 12 5 2 4 5 1 -1.139198 -0.574889 -0.426185 0 0
8 5 1 5 0 4 5 1 -1.579542 -0.441000 -1.604277 0 0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigtabular-0.0.1.tar.gz (20.4 kB view hashes)

Uploaded Source

Built Distribution

bigtabular-0.0.1-py3-none-any.whl (18.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page