Skip to main content

Statistical analysis package in Python

Project description

ROBUSTA

Author: Eitan Hemed

robusta is a statistics package in Python3 providing an interface to many common statistical analyses, performed using through R and RPY2.

PLEASE NOTE robusta is under active development and is supplied as-is with no guarantees.

Installation

Install with pip using pip install robusta-stats, see also Installation.

Documentation

See here.

Usage

Importing the library and loading data

This could take ~15 seconds as many R libraries are imported under the hood. If you begin with an empty R environment the first you import robusta should take at least a couple of minutes, as R dependencies will be installed.

import robusta as rst
Initializing robusta. Please wait.


100%|██████████| 16/16 [00:14<00:00,  1.10it/s]
# Define a helper function, to pretty-print output of DFs when converting the notebook to .md
# https://gist.github.com/rgerkin/af5b27a0e30531c30f2bf628aa41a553
from tabulate import tabulate
import IPython.display as d

def md_print_df(df):
    md = tabulate(df, headers='keys', tablefmt='pipe')
    md = md.replace('|    |','| %s |' % (df.index.name if df.index.name else ''))
    result = d.Markdown(md)
    return result

First off, we need data. Using robusta we can import R built-in and some imported datasets. You can get a full list of the datasets, similarly to calling to data() with no input arguments in R.

md_print_df(rst.get_available_datasets().tail())
Package Item Description
285 ARTool Higgins1990Table5 Split-plot Experiment Examining Effect of Moisture and Fertilizer on Dry Matter in Peat Pots
286 ARTool Higgins1990Table1.art Aligned Rank Transformed Version of Higgins1990Table1
287 ARTool Higgins1990Table1 Synthetic 3x3 Factorial Randomized Experiment
288 ARTool ElkinABC Synthetic 2x2x2 Within-Subjects Experiment
289 ARTool ElkinAB Synthetic 2x2 Within-Subjects Experiment

We can import a dataset using rst.load_dataset

iris = rst.load_dataset('iris')
md_print_df(iris.head())
dataset_rownames Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
3 4 4.6 3.1 1.5 0.2 setosa
4 5 5 3.6 1.4 0.2 setosa

Running statistical analyses

Analyses are performed through using designated model objects that also store the . The model objects are returned through calls to the function API. In this example we create a model (m) object by calling t2samples. m will be used to fit the statistical model, returning the results object.

Here is a paired-samples t-test using the Students' sleep dataset previously loaded:

# Create the model
m = rst.groupwise.T2Samples(
    data=rst.load_dataset('sleep'), independent='group', 
    dependent='extra', subject='ID', paired=True, tail='less')

# Dataframe format of the results
md_print_df(m.report_table())
t df p-value Cohen-d Low Cohen-d Cohen-d High
0 -4.06213 9 0.00141645 -4.57304 -2.70809 -0.758771
# Textual report of the results - copy and paste into your results section!
m.report_text()
't(9) = -4.06, p = 0.001'

We can reset the models in order to update the model parameters and re-fit it. In this example, we run the same model an an independent samples t-test:

m.reset(paired=False, assume_equal_variance=True)
md_print_df(m.report_table())
t df p-value Cohen-d Low Cohen-d Cohen-d High
0 -1.86081 18 0.0395934 -1.83678 -0.877196 0.106988

Bayesian t-tests

bayes_t2samples and bayes_t1sample allow you to calculate Bayes factors or sample from the posterior distribution:

m = rst.groupwise.BayesT2Samples(
        data=rst.load_dataset('mtcars'), subject='dataset_rownames',
        dependent='mpg', independent='am', prior_scale=0.5,
        paired=False)

md_print_df(m.report_table())
model bf error
0 Alt., r=0.5 71.3861 3.96597e-09
# Test different null intervals and prior values:
m.reset(prior_scale=0.1, null_interval=[0, 0.5])
print(f'{m.report_text()}\n\n')
md_print_df(m.report_table())
Alt., r=0.1 [BF1:0 = 18.64, Error = 0.001%]
model bf error
0 Alt., r=0.1 18.6411 1.79878e-07

Analysis of variance

use Anova to run between, within or mixed-design ANOVA, we load the anxiety dataset for the next demonstrations.

For non-parametric ANOVAs see KruskalWallisTest, FriedmanTest and AlignedRanksTest

# Load the dataset and modify it from a 'wide' to 'long' format dataframe
anxiety = rst.load_dataset('anxiety').set_index(['id', 'group']
                                           ).filter(regex='^t[1-3]$').stack().reset_index().rename(
    columns={0: 'score',
             'level_2': 'time'})
md_print_df(anxiety.head())
id group time score
0 1 grp1 t1 14.1
1 1 grp1 t2 14.4
2 1 grp1 t3 14.1
3 2 grp1 t1 14.5
4 2 grp1 t2 14.6
m = rst.groupwise.Anova(
        data=anxiety, subject='id',
        dependent='score', between='group', within='time')
md_print_df(m.report_table())
R[write to console]: Contrasts set to contr.sum for the following variables: group
Term p-value Partial Eta-Squared F df1 df2
0 group 0.019 0.172 4.35 2 42
1 time 0.001 0.904 394.91 1.79 75.24
2 group:time 0.001 0.84 110.19 3.58 75.24

Similarly, we run the model usign only the between subject term (group). As the model was already generated we can simpyl drop the within-subject term:

m.reset(within=None)
md_print_df(m.report_table())
R[write to console]: Contrasts set to contr.sum for the following variables: group
Term p-value Partial Eta-Squared F df1 df2
0 group 0.019 0.172 4.35 2 42

R and many other statistical packages (e.g., statsmodels support a formula interface to fit statistical models. Here it is shown that a model can also be specified by the formula kwargs rather than specifying dependent, between etc. The formula indicates that the score column is regressed by the time variable, with observations nested within the id column.

m.reset(formula='score~time|id')
md_print_df(m.report_table())
Term p-value Partial Eta-Squared F df1 df2
0 time 0.001 0.601 66.23 1.15 50.55

We can also run a similar, bayesian ANOVA using BayesAnova comparing the specified terms to the null model:

m = rst.groupwise.BayesAnova(data=anxiety, within='time',
                        dependent='score', subject='id')
md_print_df(m.report_table())
model bf error
0 time 496.129 7.82496e-05

Work in progress and planned features

robusta includes several other features that are either under development or planned for the future.

Currently under work

  • Regressions and correlations modules

Planned

  • Sequential analysis plots (inspired by JASP)

How to contribute

All help is welcome, but currently there are no specific guidelines. Please contact Eitan Hemed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robusta_stats-0.0.3.tar.gz (57.9 kB view hashes)

Uploaded Source

Built Distribution

robusta_stats-0.0.3-py3-none-any.whl (62.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page