Skip to main content

Extension of fastai.tabular for larger-than-memory datasets with Dask

Project description

BigTabular

This library replicates much the functionality of the tabular data application in the fastai library to work with larger-than-memory datasets. Pandas, which is used for data transformations in fastai.tabular, is replaced with Dask DataFrames.

Most of the Dask implementations were written as they were needed for a personal project, but then refactored to match the fastai API more closely. The flow of the Jupyter notebooks follows those from fastai.tabular closely and most of the examples and tests were replicated.

When not to use BigTabular

Don’t use this library when you don’t need to use Dask. The Dask website gives the following guidance:

Dask DataFrames are often used either when …

  1. Your data is too big
  2. Your computation is too slow and other techniques don’t work

You should probably stick to just using pandas if …

  1. Your data is small
  2. Your computation is fast (subsecond)
  3. There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.

Install

pip install bigtabular

How to use

Refer to the tutorial for a more detailed usage example.

Get a Dask DataFrame:

path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
ddf.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse <NA> Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced <NA> Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

Create dataloaders. Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods:

dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])

Create a Learner:

learn = dask_learner(dls, metrics=accuracy)

Train the model for one epoch:

learn.fit_one_cycle(1)
epoch train_loss valid_loss accuracy time
0 0.359618 0.356699 0.836550 00:51

We can then have a look at some predictions:

learn.show_results()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary salary_pred
0 5 13 1 5 2 5 1 0.402007 0.446103 1.537303 1 1
1 5 16 1 0 3 5 1 0.768961 -1.380161 -0.033487 0 0
2 5 12 5 7 4 3 1 -0.919026 5.286263 -0.426185 0 0
3 5 16 5 13 2 3 2 0.181835 -0.467029 -0.033487 0 0
4 5 13 5 5 2 5 2 -0.698853 -0.308706 -0.033487 0 0
5 5 10 3 0 1 5 2 0.255226 -1.457680 -0.033487 1 1
6 1 10 3 1 1 5 2 2.016603 -0.117934 -0.033487 1 0
7 3 12 5 2 4 5 1 -1.139198 -0.574889 -0.426185 0 0
8 5 1 5 0 4 5 1 -1.579542 -0.441000 -1.604277 0 0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigtabular-0.0.3.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bigtabular-0.0.3-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file bigtabular-0.0.3.tar.gz.

File metadata

  • Download URL: bigtabular-0.0.3.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for bigtabular-0.0.3.tar.gz
Algorithm Hash digest
SHA256 ac90042f21f5879b5a019be9cde73e0e7070b9a2f00c2890354f9c198f4dd439
MD5 5444483571045b1ec5a9699eb993f6ba
BLAKE2b-256 db72ed68e1621095572e58948c6d2ca8763a8b3271aca0f8fbaa2f5825944cc0

See more details on using hashes here.

File details

Details for the file bigtabular-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: bigtabular-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for bigtabular-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2cf9dcc6a9035d99a9c35210888659085c29efc6778866f012be59f1f434dc4b
MD5 7e28e2747ddec58a259d05bfca6d820c
BLAKE2b-256 65293649102efb3e8df46d0e088febd66f556976e761d13c73b59c6825f6e211

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page