Extension of fastai.tabular for larger-than-memory datasets with Dask
Project description
BigTabular
This library replicates much the functionality of the tabular data
application in the fastai library to work with
larger-than-memory datasets. Pandas, which is used for data
transformations in fastai.tabular
, is replaced with Dask
DataFrames.
Most of the Dask implementations were written as they were needed for a
personal project, but then refactored to match the fastai API more
closely. The flow of the Jupyter notebooks follows those from
fastai.tabular
closely and most of the examples and tests were
replicated.
When not to use BigTabular
Don’t use this library when you don’t need to use Dask. The Dask website gives the following guidance:
Dask DataFrames are often used either when …
- Your data is too big
- Your computation is too slow and other techniques don’t work
You should probably stick to just using pandas if …
- Your data is small
- Your computation is fast (subsecond)
- There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.
Install
pip install bigtabular
How to use
Refer to the tutorial for a more detailed usage example.
Get a Dask DataFrame:
path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
ddf.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | <NA> | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | <NA> | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
Create dataloaders. Some of the columns are continuous (like age) and we
will treat them as float numbers we can feed our model directly. Others
are categorical (like workclass or education) and we will convert them
to a unique index that we will feed to embedding layers. We can specify
our categorical and continuous column names, as well as the name of the
dependent variable in
DaskDataLoaders
factory methods:
dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [DaskCategorify, DaskFillMissing, DaskNormalize])
Create a Learner
:
learn = dask_learner(dls, metrics=accuracy)
Train the model for one epoch:
learn.fit_one_cycle(1)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.359618 | 0.356699 | 0.836550 | 00:51 |
We can then have a look at some predictions:
learn.show_results()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | salary_pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 13 | 1 | 5 | 2 | 5 | 1 | 0.402007 | 0.446103 | 1.537303 | 1 | 1 |
1 | 5 | 16 | 1 | 0 | 3 | 5 | 1 | 0.768961 | -1.380161 | -0.033487 | 0 | 0 |
2 | 5 | 12 | 5 | 7 | 4 | 3 | 1 | -0.919026 | 5.286263 | -0.426185 | 0 | 0 |
3 | 5 | 16 | 5 | 13 | 2 | 3 | 2 | 0.181835 | -0.467029 | -0.033487 | 0 | 0 |
4 | 5 | 13 | 5 | 5 | 2 | 5 | 2 | -0.698853 | -0.308706 | -0.033487 | 0 | 0 |
5 | 5 | 10 | 3 | 0 | 1 | 5 | 2 | 0.255226 | -1.457680 | -0.033487 | 1 | 1 |
6 | 1 | 10 | 3 | 1 | 1 | 5 | 2 | 2.016603 | -0.117934 | -0.033487 | 1 | 0 |
7 | 3 | 12 | 5 | 2 | 4 | 5 | 1 | -1.139198 | -0.574889 | -0.426185 | 0 | 0 |
8 | 5 | 1 | 5 | 0 | 4 | 5 | 1 | -1.579542 | -0.441000 | -1.604277 | 0 | 0 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bigtabular-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59924b4bb4cc6a488aef3b7da5a1d64347912507b003945288a872150d02575a |
|
MD5 | b5a56539d0045b189e846a772b768227 |
|
BLAKE2b-256 | 303ac66c4df5049f1e5cdf8a3dfc496c216c657910a9376b6b49f4e100629292 |