sidetable builds simple but useful summary tables of your data

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language
- Python :: 3.6

Project description

sidetable

sidetable is a combination of a supercharged pandas value_counts plus crosstab that builds simple but useful summary tables of your pandas DataFrame.

Usage is straightforward. Install and import sidetable. Then access it through the new .st accessor on your DataFrame.

For the Titanic data: df.st.freq(['class']) will build a frequency table like this:

	class	Count	Percent	Cumulative Count	Cumulative Percent
0	Third	491	0.551066	491	0.551066
1	First	216	0.242424	707	0.79349
2	Second	184	0.20651	891	1

You can also summarize missing values with df.st.missing():

	Missing	Total	Percent
deck	688	891	0.772166
age	177	891	0.198653
embarked	2	891	0.00224467
embark_town	2	891	0.00224467
survived	0	891	0
pclass	0	891	0
sex	0	891	0
sibsp	0	891	0
parch	0	891	0
fare	0	891	0
class	0	891	0
who	0	891	0
adult_male	0	891	0
alive	0	891	0
alone	0	891	0

sidetable has several useful features:

See total counts and their relative percentages in one table. This is roughly equivalent to combining the output of value_counts() and value_counts(normalize=True) into one table.
Include cumulative totals and percentages to better understand your thresholds. The Pareto principle applies to many different scenarios and this function makes it easy to see how your data is cumulatively distributed.
Aggregate multiple columns together to see frequency counts for grouped data.
Provide a threshold point above which all data is grouped into a single bucket. This is useful for quickly identifying the areas to focus your analysis.
Get a count of the missing values in your data.

Quickstart

For the impatient:

$ pip install sidetable

import sidetable
import pandas as pd

# Create your DataFrame
df = pd.read_csv(myfile.csv)

# Build a frequency table for one or more columns
df.st.freq(['column1', 'column2'])

# See what data is missing
df.st.missing()

That's it.

Read on for more details and more examples of what you can do.

Rationale

The idea behind sidetable is that there are a handful of useful data analysis tasks that you might run on any data set early in the data analysis process. While each of these tasks can be done in a handful of lines of pandas code, it is a lot of typing and difficult to remember.

In addition to providing useful functionality, this project is also a test to see how to build custom accessors using some of pandas relatively new API. I am hopeful this can serve as a model for other projects whether open source or just for your own usage.

The solutions in sidetable are heavily based on three sources:

This tweet thread by Peter Baumgartner
An excellent article by Steve Miller that lays out many of the code concepts incorporated into sidetable.
Ted Petrou's post on finding the percentage of missing values in a DataFrame.

I very much appreciate the work that all three authors did to point me in this direction.

Installation

$ pip install sidetable

This is the preferred method to install sidetable, as it will always install the most recent stable release. sidetable requires pandas 1.0 or higher and no additional dependencies. It should run anywhere that pandas runs.

Usage

import pandas as pd
import sidetable
import seaborn as sns

df = sns.load_dataset('titanic')

sidetable uses the pandas DataFrame accessor api to add a .st accessor to all of your DataFrames. Once you import sidetable you are ready to go. In these examples, I will be using seaborn's Titanic dataset as an example but seaborn is not a direct dependency.

If you have used value_counts() before, you have probably wished it were easier to combine the values with percentage distribution.

df['class'].value_counts()

Third     491
First     216
Second    184
Name: class, dtype: int64

df['class'].value_counts(normalize=True)

Third     0.551066
First     0.242424
Second    0.206510
Name: class, dtype: float64

Which can be done, but is messy and a lot of typing and remembering:

pd.concat([df['class'].value_counts().rename('count'), 
        df['class'].value_counts(normalize=True).rename('percentage')], axis=1)

	count	percentage
Third	491	0.551066
First	216	0.242424
Second	184	0.20651

Using sidetable is much simpler and you get cumulative totals, percents and more flexibility:

df.st.freq(['class'])

	class	Count	Percent	Cumulative Count	Cumulative Percent
0	Third	491	0.551066	491	0.551066
1	First	216	0.242424	707	0.79349
2	Second	184	0.20651	891	1

In addition, you can group columns together. If we want to see the breakdown among class and sex:

df.st.freq(['sex', 'class'])

	sex	class	Count	Percent	Cumulative Count	Cumulative Percent
0	male	Third	347	0.38945	347	0.38945
1	female	Third	144	0.161616	491	0.551066
2	male	First	122	0.136925	613	0.687991
3	male	Second	108	0.121212	721	0.809203
4	female	First	94	0.105499	815	0.914703
5	female	Second	76	0.0852974	891	1

You can use as many groupings as you would like.

By default, sidetable counts the data. However, you can specify a value argument to indicate that the data should be summed based on the data in another column. For this data set, we can see how the fares are distributed by class:

df.st.freq(['class'], value='fare')

	class	fare	Percent	Cumulative fare	Cumulative Percent
0	First	18177.4	0.633493	18177.4	0.633493
1	Third	6714.7	0.234011	24892.1	0.867504
2	Second	3801.84	0.132496	28693.9	1

Another feature of sidetable is that you can specify a threshold. For many data analysis, you may want to break down into large groupings to focus on and ignore others. You can use the thresh argument to define a threshold and group all entries above that threshold into an "Other" grouping:

df.st.freq(['class', 'who'], value='fare', thresh=.80)

	class	who	fare	Percent	Cumulative fare	Cumulative Percent
0	First	woman	9492.94	0.330834	9492.94	0.330834
1	First	man	7848.18	0.273513	17341.1	0.604348
2	Third	man	3617.53	0.126073	20958.6	0.73042
3	Second	man	1886.36	0.0657406	22845	0.796161
4	Others	Others	5848.95	0.203839	28693.9	1

You can further customize by specifying the label to use for all the others:

df.st.freq(['class', 'who'], value='fare', thresh=.80, other_label='All others')

	class	who	fare	Percent	Cumulative fare	Cumulative Percent
0	First	woman	9492.94	0.330834	9492.94	0.330834
1	First	man	7848.18	0.273513	17341.1	0.604348
2	Third	man	3617.53	0.126073	20958.6	0.73042
3	Second	man	1886.36	0.0657406	22845	0.796161
4	All others	All others	5848.95	0.203839	28693.9	1

Finally, sidetable includes a summary table that shows the missing values in your data by count and percentage of total missing values in a column.

df.st.missing()

	Missing	Total	Percent
deck	688	891	0.772166
age	177	891	0.198653
embarked	2	891	0.00224467
embark_town	2	891	0.00224467
survived	0	891	0
pclass	0	891	0
sex	0	891	0
sibsp	0	891	0
parch	0	891	0
fare	0	891	0
class	0	891	0
who	0	891	0
adult_male	0	891	0
alive	0	891	0
alone	0	891	0

Caveats

sidetable supports grouping on any data type in a pandas DataFrame. This means that you could try something like:

df.st.freq(['fare'])

In some cases where there are a fairly small discrete number of this may be useful. However, if you have a lot of unique values, you should bin the data first. In the example, above the data would include 248 rows and not be terribly useful.

One alternative could be:

df['fare_bin'] = pd.qcut(df['fare'], q=4, labels=['low', 'medium', 'high', 'x-high'])
df.st.freq(['fare_bin'])

	fare_bin	Count	Percent	Cumulative Count	Cumulative Percent
0	medium	224	0.251403	224	0.251403
1	low	223	0.250281	447	0.501684
2	x-high	222	0.249158	669	0.750842
3	high	222	0.249158	891	1

The other caveat is that null or missing values can cause data to drop out while aggregating. For instance, if we look at the deck variable, there are a lot of missing values.

df.st.freq(['deck'])

	deck	Count	Percent	Cumulative Count	Cumulative Percent
0	C	59	0.29064	59	0.29064
1	B	47	0.231527	106	0.522167
2	D	33	0.162562	139	0.684729
3	E	32	0.157635	171	0.842365
4	A	15	0.0738916	186	0.916256
5	F	13	0.0640394	199	0.980296
6	G	4	0.0197044	203	1

The total cumulative count only goes up to 203 not the 891 we have seen in other examples. Future versions of sidetable may handle this differently. For now, it is up to you to decide how best to handle unknowns. For example, this version of the Titanic data set has a categorical value for deck so using fillna requires and extra step:

df['deck_fillna'] = df['deck'].cat.add_categories('UNK').fillna('UNK')
df.st.freq(['deck_fillna'])

	deck_fillna	Count	Percent	Cumulative Count	Cumulative Percent
0	UNK	688	0.772166	688	0.772166
1	C	59	0.0662177	747	0.838384
2	B	47	0.0527497	794	0.891134
3	D	33	0.037037	827	0.928171
4	E	32	0.0359147	859	0.964085
5	A	15	0.016835	874	0.98092
6	F	13	0.0145903	887	0.995511
7	G	4	0.00448934	891	1

Another variant of this is that there might be certain groupings where there are no valid counts.

For instance, if we look at the deck and class:

df.st.freq(['deck', 'class'])

	deck	class	Count	Percent	Cumulative Count	Cumulative Percent
0	C	First	59	0.29064	59	0.29064
1	B	First	47	0.231527	106	0.522167
2	D	First	29	0.142857	135	0.665025
3	E	First	25	0.123153	160	0.788177
4	A	First	15	0.0738916	175	0.862069
5	F	Second	8	0.0394089	183	0.901478
6	F	Third	5	0.0246305	188	0.926108
7	G	Third	4	0.0197044	192	0.945813
8	E	Second	4	0.0197044	196	0.965517
9	D	Second	4	0.0197044	200	0.985222
10	E	Third	3	0.0147783	203	1

There are only 11 combinations. If we want to see all - even if there are not any passengers fitting that criteria, use clip_0=False

df.st.freq(['deck', 'class'], clip_0=False)

	deck	class	Count	Percent	Cumulative Count	Cumulative Percent
0	C	First	59	0.29064	59	0.29064
1	B	First	47	0.231527	106	0.522167
2	D	First	29	0.142857	135	0.665025
3	E	First	25	0.123153	160	0.788177
4	A	First	15	0.0738916	175	0.862069
5	F	Second	8	0.0394089	183	0.901478
6	F	Third	5	0.0246305	188	0.926108
7	G	Third	4	0.0197044	192	0.945813
8	E	Second	4	0.0197044	196	0.965517
9	D	Second	4	0.0197044	200	0.985222
10	E	Third	3	0.0147783	203	1
11	G	Second	0	0	203	1
12	G	First	0	0	203	1
13	F	First	0	0	203	1
14	D	Third	0	0	203	1
15	C	Third	0	0	203	1
16	C	Second	0	0	203	1
17	B	Third	0	0	203	1
18	B	Second	0	0	203	1
19	A	Third	0	0	203	1
20	A	Second	0	0	203	1

In many cases this might be too much data, but sometimes the fact that a combination is missing could be insightful.

TODO

Handle NaN values more effectively
Offer binning options for continuous variables
Offer more options, maybe plotting?

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given. If you have a new idea for a simple table that we should add, please submit a ticket.

For more info please click here

Credits

This package was created with Cookiecutter and the oldani/cookiecutter-simple-pypackage project template. The code used in this package is heavily based on the posts from Peter Baumgartner, Steve Miller and Ted Petrou. Thank you!

Cookiecutter
oldani/cookiecutter-simple-pypackage
Peter Baumgartner - tweet thread
Steve Miller - article
Ted Petrou - post

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language
- Python :: 3.6

Release history Release notifications | RSS feed

0.9.1

Oct 29, 2022

0.9.0

Aug 19, 2021

0.8.0

Nov 29, 2020

0.7.0

Aug 21, 2020

0.6.0

Jul 4, 2020

0.5.0

Jun 22, 2020

0.4.0

Jun 22, 2020

0.4.dev0 pre-release

Jun 22, 2020

0.3.0

May 31, 2020

0.2.0

May 28, 2020

This version

0.1.0

May 26, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sidetable-0.1.0.tar.gz (14.9 kB view hashes)

Uploaded May 26, 2020 Source

Hashes for sidetable-0.1.0.tar.gz

Hashes for sidetable-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`49e1438691f6ceb2381bc25ef68e87609221e828985579f743654f2dbb84f3c3`
MD5	`b2cc442af9f0901a418a91688db49b39`
BLAKE2b-256	`3727db60e6ba14ceaf6fa0b73156bec397c9375030e76571460c51673460622b`