Project description

BigTabular

This library replicates much the functionality of the tabular data application in the fastai library to work with larger-than-memory datasets. Pandas, which is used for data transformations in fastai.tabular, is replaced with Dask DataFrames.

Most of the Dask implementations were written as they were needed for a personal project, but then refactored to match the fastai API more closely. The flow of the Jupyter notebooks follows those from fastai.tabular closely and most of the examples and tests were replicated.

When not to use BigTabular

Don’t use this library when you don’t need to use Dask. The Dask website gives the following guidance:

Dask DataFrames are often used either when …

Your data is too big

Your computation is too slow and other techniques don’t work

You should probably stick to just using pandas if …

Your data is small

Your computation is fast (subsecond)

There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.

Install

pip install bigtabular

How to use

Refer to the tutorial for a more detailed usage example.

Get a Dask DataFrame:

path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
ddf.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	<NA>	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	<NA>	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

Create dataloaders. Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods:

dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])

Create a Learner:

learn = dask_learner(dls, metrics=accuracy)

Train the model for one epoch:

learn.fit_one_cycle(1)

epoch	train_loss	valid_loss	accuracy	time
0	0.359618	0.356699	0.836550	00:51

We can then have a look at some predictions:

learn.show_results()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	salary_pred
0	5	13	1	5	2	5	1	0.402007	0.446103	1.537303	1	1
1	5	16	1	0	3	5	1	0.768961	-1.380161	-0.033487	0	0
2	5	12	5	7	4	3	1	-0.919026	5.286263	-0.426185	0	0
3	5	16	5	13	2	3	2	0.181835	-0.467029	-0.033487	0	0
4	5	13	5	5	2	5	2	-0.698853	-0.308706	-0.033487	0	0
5	5	10	3	0	1	5	2	0.255226	-1.457680	-0.033487	1	1
6	1	10	3	1	1	5	2	2.016603	-0.117934	-0.033487	1	0
7	3	12	5	2	4	5	1	-1.139198	-0.574889	-0.426185	0	0
8	5	1	5	0	4	5	1	-1.579542	-0.441000	-1.604277	0	0

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.0.3

Jul 21, 2024

This version

0.0.2

Jun 25, 2024

0.0.1

May 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigtabular-0.0.2.tar.gz (20.5 kB view hashes)

Uploaded Jun 25, 2024 Source

Built Distribution

bigtabular-0.0.2-py3-none-any.whl (18.5 kB view hashes)

Uploaded Jun 25, 2024 Python 3

Hashes for bigtabular-0.0.2.tar.gz

Hashes for bigtabular-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`21b0cb56f41375825e7091af512e0c4a3507d58b36898e5800ac3cb82f5e51e0`
MD5	`ded9b4e4805aafa9cf76e12403f65a79`
BLAKE2b-256	`d3fd71ce6260d43f4095e30d373a8a811a17c6cf6fbc9bebe646813d0cc880ed`

Hashes for bigtabular-0.0.2-py3-none-any.whl

Hashes for bigtabular-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`59924b4bb4cc6a488aef3b7da5a1d64347912507b003945288a872150d02575a`
MD5	`b5a56539d0045b189e846a772b768227`
BLAKE2b-256	`303ac66c4df5049f1e5cdf8a3dfc496c216c657910a9376b6b49f4e100629292`