Skip to main content

Package to separate a df by percentile on a column, bootstrap from the upper/lower bounds, create scatter plot and histogram, return p values on ttests and pearsons correlation coefficient

Project description

df_percentile Package

Background

This package can split a Pandas Dataframe into two groups by a specified percentile and column to split on.

It can be useful for identifying trends in data sets with features such as price or salary. E.g. in a NYC rent listing database with the following columns: price, bedrooms, bathrooms

In [7]: df.head()

Out[7]: 
   bedrooms  bathrooms  latitude  longitude  price
0         3        1.5   40.7145   -73.9425   3000
1         2        1.0   40.7947   -73.9667   5465
2         1        1.0   40.7388   -74.0018   2850
3         1        1.0   40.7539   -73.9677   3275
4         4        1.0   40.8241   -73.9493   3350

If you split the df by price, the scatter method can help you visualize the distribution of features such as bathrooms and bedrooms for the higher rent vs lower rent group. The create_df will run a ttest on the two groups and whichever columns you specified as you instantiated the object, returning p-values with the means of the features. It can also bootstrap from the two groups and create a histogram of the distribution of sample means, along with the 95% confidence intervals.

Instructions

First initialize the object with:

  1. df_percentile(df, col_names, col_to_separate) bedrooms and bathrooms are the features you care about, separate by price
In [12]: new_df = df_percentile(df, ['bedrooms','bathrooms'], 'price')
This dataframe is separated by price  
  • create_df(percentile) p-values were very small for the two groups when you split the rent price by the top 30 percentile
In [13]: new_df.create_df(70)

Out[13]: 
           p-values  upper_bound_means  lower_bound_means
bedrooms        0.0           2.319909           1.159946
bathrooms       0.0           1.522683           1.030201
  • bootstrap(percentile, col_name, n_simulations = 10000, ci = 95)
In [16]: new_df.bootstrap(70, 'RAA')    
# was used in a project of mine for splitting MLB relief pitchers by salary and measuring the group performance

Bootstrap

  • bootstrap_stats()
      In [6]: data.bootstrap_stats()

      The 95% confidence intervals for the upper bound group ranges from -1.1944029850746265 to 2.716417910447761.
      The lower bound group ranges from -2.529032258064516 and -0.44516129032258067.
      Means of the distribution: 
      Upper Bound Group: 0.7459731343283582
      Lower Bound Group:-1.4869800000000002
  • corr(percentile, col_name)
In [25]: new_df.corr(95, 'bathrooms')

For the lower bound group: 
The correlation coefficent is 0.5437659981793554 and the p-value is 0.0
For the higher bound pitcher group: 
The correlation coefficent is 0.03580173165337624 and the p-value is 0.07722030909098343
  • scatter(percentile, col_name)
In [39]: new_df.scatter(80, 'bedrooms')

scatter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfpercentile-pkg-chanrl-0.0.1.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

dfpercentile_pkg_chanrl-0.0.1-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file dfpercentile-pkg-chanrl-0.0.1.tar.gz.

File metadata

  • Download URL: dfpercentile-pkg-chanrl-0.0.1.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for dfpercentile-pkg-chanrl-0.0.1.tar.gz
Algorithm Hash digest
SHA256 2d1f547fea9129fb2c5930ecebc8af6a2acf8acc3c9f811e4197ed45697a9944
MD5 8dcd5a02864d5f7609875750158b1617
BLAKE2b-256 41f672e81652a519192b34e077a0784c42b7b4d70225576162d8cae81f740188

See more details on using hashes here.

File details

Details for the file dfpercentile_pkg_chanrl-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: dfpercentile_pkg_chanrl-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for dfpercentile_pkg_chanrl-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bad63c47551351c66a3596775068df882d6bfd565eaa410cb01727dbbcc9a23f
MD5 07ee47f9cb4d846a1550323b7ac06e2b
BLAKE2b-256 23404293eb5df55de782b086f50f5500098c4f3c73684c27d47f2293a6fc3e5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page