Skip to main content

A set of utilities for running and evaluating experiments at Greenhouse. Primarily designed to install in a Mode Analytics Python notebook.

Project description

Greenhouse Data Utilities

The Greenhouse Data Utilities package includes a series of tools to streamline the evaluation of several common experimental designs. It's designed and maintained by the data science team at Greenhouse Software with the intention of installing in a Mode Analytics Python notebook.

Please note that this package was designed for internal use by the data science team. You're welcome to use it, but we will prioritize the experimentation needs of our team when reading through issues/feature requests.

Sub-modules

  • gh_data_utils.data_visualization
  • gh_data_utils.stat_tests

data_visualization

Functions for generating visual representations of statistical tests.

get_overlapping_distributions

Generates a Seaborn plot with overlapping distributions between two or more groups in order to visualize potential differences that may be detected with parametric non-parametric statistical tests. These can be used in the final presentation of results.

Parameters
  • data: pandas dataframe, required

    Dataframe used to generate charts. Data should be in a tidy format.

  • groups: str, required

    The name of the column in data by which you are grouping results.

  • data_col: str, required

    The name of the column in data with continuous data to compare by group.

  • order: list, optional; default = None

    The order you want groups to appear in the graphs (i.e., Before and After). Must match a list of distinct values in the groups parameter.

  • x-label: str, optional; default = '' (empty)

    x-axis label for the chart.

  • y-label: str, optional; default = '' (empty)

    y-axis label for the chart.

  • title: str, optional; default = '' (empty)

    Title for the chart.

  • bins: int, required; default = 20

    Number of bins for the distributions.

  • tick_format: str; options = 'int, pct; default = 'int'

    A string indicating the tick format for the graph axes based on the type of data used.

get_barplot

Generates a Seaborn barplot with error bars in order to visualize mean differences and confidence intervals for those differences between two or more groups. These can be used in the final presentation of results.

Parameters
  • data: pandas dataframe; required

    Dataframe used to generate charts. Data should be in a tidy format.

  • groups: str; required

    The name of the column in data by which you are grouping results.

  • data_col: str; required

    The name of the column in data with continuous data to compare by group.

  • ci: int; default = 95

    Confidence intervals for the error bars.

  • order: list; optional; default = None

    The order you want groups to appear in the graphs (i.e., Before and After). Must match a list of distinct values in the groups parameter.

  • hue: str; optional; default = None

    Seaborn plot hue in order to generate grouped comparisons.

  • x-label: str; default = '' (empty)

    x-axis label for the chart.

  • y-label: str; default = '' (empty)

    y-axis label for the chart.

  • title: str; default = '' (empty)

    Title for the chart.

  • palette: optional; default = None

    Seaborn palette for the chart.

    • tick_format: str; options = 'int, pct; default = 'int'

      A string indicating the tick format for the graph axes based on the type of data used.

stat_tests

Functions for conducting the appropriate parametric and non-parametric statistical test based on the specified experimental design and number of groups. We recommend conducting an a priori power analysis to ensure each group you're comparing has a sufficient sample size.

run_stat_test

Conduct a parametric and non-parametric statistical test for two or more groups. The type of test based on number of groups is determined automatically.

Parameters
  • data: pandas dataframe; required

    Dataframe used to generate charts. Data should be in a tidy format.

  • groups: str; required

    The name of the column in data by which you are grouping results.

  • data_col: str; required

    The name of the column in data with continuous data to compare by group.

  • index: str; required

    The dataframe column to use as an index when shaping data for a specific test. This should be your unit of comparison (i.e., user, day, event, etc).

  • dimensions: list; optional; default = None

    A subset of groups to use in statistical comparisons. If this isn't specified, dimensions for statistical tests will be a list of distinct values in groups.

  • comparison: str; required; options = 'ind, rep; default = 'ind'

    The type of experimental design (independent or repeated measures).

  • description: str; optional; default = ''

    A description of the statistical test to include at the top of the summary of results. Defaults to none, and uses the standard output from each test.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gh-data-utils-0.0.1.tar.gz (4.6 kB view hashes)

Uploaded Source

Built Distribution

gh_data_utils-0.0.1-py3-none-any.whl (6.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page