Extension of fastai.tabular for larger-than-memory datasets with Dask

These details have not been verified by PyPI

Project links

Homepage

Project description

BigTabular

This library replicates much the functionality of the tabular data application in the fastai library to work with larger-than-memory datasets. Pandas, which is used for data transformations in fastai.tabular, is replaced with Dask DataFrames.

Most of the Dask implementations were written as they were needed for a personal project, but then refactored to match the fastai API more closely. The flow of the Jupyter notebooks follows those from fastai.tabular closely and most of the examples and tests were replicated.

When not to use BigTabular

Don’t use this library when you don’t need to use Dask. The Dask website gives the following guidance:

Dask DataFrames are often used either when …

Your data is too big

Your computation is too slow and other techniques don’t work

You should probably stick to just using pandas if …

Your data is small

Your computation is fast (subsecond)

There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.

Install

pip install bigtabular

How to use

Refer to the tutorial for a more detailed usage example.

Get a Dask DataFrame:

path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
ddf.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	<NA>	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	<NA>	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

Create dataloaders. Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods:

dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])

Create a Learner:

learn = dask_learner(dls, metrics=accuracy)

Train the model for one epoch:

learn.fit_one_cycle(1)

epoch	train_loss	valid_loss	accuracy	time
0	0.359618	0.356699	0.836550	00:51

We can then have a look at some predictions:

learn.show_results()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	salary_pred
0	5	13	1	5	2	5	1	0.402007	0.446103	1.537303	1	1
1	5	16	1	0	3	5	1	0.768961	-1.380161	-0.033487	0	0
2	5	12	5	7	4	3	1	-0.919026	5.286263	-0.426185	0	0
3	5	16	5	13	2	3	2	0.181835	-0.467029	-0.033487	0	0
4	5	13	5	5	2	5	2	-0.698853	-0.308706	-0.033487	0	0
5	5	10	3	0	1	5	2	0.255226	-1.457680	-0.033487	1	1
6	1	10	3	1	1	5	2	2.016603	-0.117934	-0.033487	1	0
7	3	12	5	2	4	5	1	-1.139198	-0.574889	-0.426185	0	0
8	5	1	5	0	4	5	1	-1.579542	-0.441000	-1.604277	0	0

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.3

Jul 21, 2024

0.0.2

Jun 25, 2024

0.0.1

May 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigtabular-0.0.3.tar.gz (20.6 kB view details)

Uploaded Jul 21, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bigtabular-0.0.3-py3-none-any.whl (18.5 kB view details)

Uploaded Jul 21, 2024 Python 3

File details

Details for the file bigtabular-0.0.3.tar.gz.

File metadata

Download URL: bigtabular-0.0.3.tar.gz
Upload date: Jul 21, 2024
Size: 20.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for bigtabular-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`ac90042f21f5879b5a019be9cde73e0e7070b9a2f00c2890354f9c198f4dd439`
MD5	`5444483571045b1ec5a9699eb993f6ba`
BLAKE2b-256	`db72ed68e1621095572e58948c6d2ca8763a8b3271aca0f8fbaa2f5825944cc0`

See more details on using hashes here.

File details

Details for the file bigtabular-0.0.3-py3-none-any.whl.

File metadata

Download URL: bigtabular-0.0.3-py3-none-any.whl
Upload date: Jul 21, 2024
Size: 18.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for bigtabular-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2cf9dcc6a9035d99a9c35210888659085c29efc6778866f012be59f1f434dc4b`
MD5	`7e28e2747ddec58a259d05bfca6d820c`
BLAKE2b-256	`65293649102efb3e8df46d0e088febd66f556976e761d13c73b59c6825f6e211`

See more details on using hashes here.

bigtabular 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BigTabular

When not to use BigTabular

Install

How to use

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes