A set of utilities for running and evaluating experiments at Greenhouse. Primarily designed to install in a Mode Analytics Python notebook.
Project description
Greenhouse Data Utilities
The Greenhouse Data Utilities package includes a series of tools to streamline the evaluation of several common experimental designs. It's designed and maintained by the data science team at Greenhouse Software with the intention of installing in a Mode Analytics Python notebook.
Please note that this package was designed for internal use by the data science team. You're welcome to use it, but we will prioritize the experimentation needs of our team when reading through issues/feature requests.
Sub-modules
- gh_data_utils.data_visualization
- gh_data_utils.stat_tests
data_visualization
Functions for generating visual representations of statistical tests.
get_overlapping_distributions
Generates a Seaborn plot with overlapping distributions between two or more groups in order to visualize potential differences that may be detected with parametric non-parametric statistical tests. These can be used in the final presentation of results.
Parameters
-
data
: pandas dataframe, requiredDataframe used to generate charts. Data should be in a tidy format.
-
groups
: str, requiredThe name of the column in
data
by which you are grouping results. -
data_col
: str, requiredThe name of the column in
data
with continuous data to compare by group. -
order
: list, optional; default =None
The order you want groups to appear in the graphs (i.e., Before and After). Must match a list of distinct values in the
groups
parameter. -
x-label
: str, optional; default =''
(empty)x-axis label for the chart.
-
y-label
: str, optional; default =''
(empty)y-axis label for the chart.
-
title
: str, optional; default =''
(empty)Title for the chart.
-
bins
: int, required; default =20
Number of bins for the distributions.
-
tick_format
: str; options ='int
,pct
; default ='int'
A string indicating the tick format for the graph axes based on the type of data used.
get_barplot
Generates a Seaborn barplot with error bars in order to visualize mean differences and confidence intervals for those differences between two or more groups. These can be used in the final presentation of results.
Parameters
-
data
: pandas dataframe; requiredDataframe used to generate charts. Data should be in a tidy format.
-
groups
: str; requiredThe name of the column in
data
by which you are grouping results. -
data_col
: str; requiredThe name of the column in
data
with continuous data to compare by group. -
ci
: int; default =95
Confidence intervals for the error bars.
-
order
: list; optional; default =None
The order you want groups to appear in the graphs (i.e., Before and After). Must match a list of distinct values in the
groups
parameter. -
hue
: str; optional; default =None
Seaborn plot hue in order to generate grouped comparisons.
-
x-label
: str; default =''
(empty)x-axis label for the chart.
-
y-label
: str; default =''
(empty)y-axis label for the chart.
-
title
: str; default =''
(empty)Title for the chart.
-
palette
: optional; default =None
Seaborn palette for the chart.
-
tick_format
: str; options ='int
,pct
; default ='int'
A string indicating the tick format for the graph axes based on the type of data used.
-
stat_tests
Functions for conducting the appropriate parametric and non-parametric statistical test based on the specified experimental design and number of groups. We recommend conducting an a priori power analysis to ensure each group you're comparing has a sufficient sample size.
run_stat_test
Conduct a parametric and non-parametric statistical test for two or more groups. The type of test based on number of groups is determined automatically.
Parameters
-
data
: pandas dataframe; requiredDataframe used to generate charts. Data should be in a tidy format.
-
groups
: str; requiredThe name of the column in
data
by which you are grouping results. -
data_col
: str; requiredThe name of the column in
data
with continuous data to compare by group. -
index
: str; requiredThe dataframe column to use as an index when shaping data for a specific test. This should be your unit of comparison (i.e., user, day, event, etc).
-
dimensions
: list; optional; default =None
A subset of groups to use in statistical comparisons. If this isn't specified, dimensions for statistical tests will be a list of distinct values in
groups
. -
comparison
: str; required; options ='ind
,rep
; default ='ind'
The type of experimental design (independent or repeated measures).
-
description
: str; optional; default =''
A description of the statistical test to include at the top of the summary of results. Defaults to none, and uses the standard output from each test.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gh_data_utils-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e30588dd99546ccbfbbea9057d570d0ab86c2bab8e68d3ef2560fad2040dd8e3 |
|
MD5 | e8734b99dbe1c4ec9f2a7868dbec7d8e |
|
BLAKE2b-256 | 126ecc30572ae5e06bb377029be84f805757a714faf6e8eff5c7f43769c55d3f |