Package to separate a df by percentile on a column, bootstrap from the upper/lower bounds, create scatter plot and histogram, return p values on ttests and pearsons correlation coefficient
Project description
df_percentile Package
Background
This package can split a Pandas Dataframe into two groups by a specified percentile and column to split on.
It can be useful for identifying trends in data sets with features such as price or salary. E.g. in a NYC rent listing database with the following columns: price, bedrooms, bathrooms
In [7]: df.head()
Out[7]:
bedrooms bathrooms latitude longitude price
0 3 1.5 40.7145 -73.9425 3000
1 2 1.0 40.7947 -73.9667 5465
2 1 1.0 40.7388 -74.0018 2850
3 1 1.0 40.7539 -73.9677 3275
4 4 1.0 40.8241 -73.9493 3350
If you split the df by price, the scatter method can help you visualize the distribution of features such as bathrooms and bedrooms for the higher rent vs lower rent group. The create_df will run a ttest on the two groups and whichever columns you specified as you instantiated the object, returning p-values with the means of the features. It can also bootstrap from the two groups and create a histogram of the distribution of sample means, along with the 95% confidence intervals.
Instructions
First initialize the object with:
- df_percentile(df, col_names, col_to_separate) bedrooms and bathrooms are the features you care about, separate by price
In [12]: new_df = df_percentile(df, ['bedrooms','bathrooms'], 'price')
This dataframe is separated by price
- create_df(percentile) p-values were very small for the two groups when you split the rent price by the top 30 percentile
In [13]: new_df.create_df(70)
Out[13]:
p-values upper_bound_means lower_bound_means
bedrooms 0.0 2.319909 1.159946
bathrooms 0.0 1.522683 1.030201
- bootstrap(percentile, col_name, n_simulations = 10000, ci = 95)
In [16]: new_df.bootstrap(70, 'RAA')
# was used in a project of mine for splitting MLB relief pitchers by salary and measuring the group performance
- bootstrap_stats()
In [6]: data.bootstrap_stats()
The 95% confidence intervals for the upper bound group ranges from -1.1944029850746265 to 2.716417910447761.
The lower bound group ranges from -2.529032258064516 and -0.44516129032258067.
Means of the distribution:
Upper Bound Group: 0.7459731343283582
Lower Bound Group:-1.4869800000000002
- corr(percentile, col_name)
In [25]: new_df.corr(95, 'bathrooms')
For the lower bound group:
The correlation coefficent is 0.5437659981793554 and the p-value is 0.0
For the higher bound pitcher group:
The correlation coefficent is 0.03580173165337624 and the p-value is 0.07722030909098343
- scatter(percentile, col_name)
In [39]: new_df.scatter(80, 'bedrooms')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dfpercentile-pkg-chanrl-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d1f547fea9129fb2c5930ecebc8af6a2acf8acc3c9f811e4197ed45697a9944 |
|
MD5 | 8dcd5a02864d5f7609875750158b1617 |
|
BLAKE2b-256 | 41f672e81652a519192b34e077a0784c42b7b4d70225576162d8cae81f740188 |
Hashes for dfpercentile_pkg_chanrl-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bad63c47551351c66a3596775068df882d6bfd565eaa410cb01727dbbcc9a23f |
|
MD5 | 07ee47f9cb4d846a1550323b7ac06e2b |
|
BLAKE2b-256 | 23404293eb5df55de782b086f50f5500098c4f3c73684c27d47f2293a6fc3e5b |