Skip to main content

descriptive statistics using Pebay results

Project description

Information about repository and package maintenance actions can be found on the Wiki.

Install the package from PyPI using pip:

bash> pip install pebaystats

pebaystats

Provides a single pass generation of statistical moments. This package is based on the formulas described in the document Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments, Phillipe Pébay, Sandia National Laboratories

Read “The Full Manual” for a more detailed description of this package.

The current implementation of this package allows computation of statistical moments for more than one data set (column) at a time. Currently only the first four moments are computed and the general purpose algorithm from the source paper is not yet implemented.

Note

The aggregation of kurtosis values appears to suffer some loss of accuracy when compared with the SciPy generated values. This does not occur when accumulating simple data values, and the cause is not currently known. I suspect this is an arithmetic precision effect, but have not investigated.

This Python implementation evolved from my C++ code which includes the ability to remove/disaggregate data from the accumulators as well. That feature will eventually be migrated here.

Quick Start

from __future__ import print_function

Import the aggregation object from the module.

from pebaystats import dstats

Create a few objects with various depths (number of moments) and widths (number of columns to compute statistics for). Here the stats1 and stats3 objects each accumulate two moments for a single column of data, and the stats2 object collects 4 statistical moments for 4 columns of data.

stats1 = dstats(2,1)
stats2 = dstats(4,4)
stats3 = dstats(2,1)

Add individual data values to the single column accumulation of the stats1 object. Print the object to view its state, which includes the moment values so far accumulated. Also, print the list of lists returned from the statistics() method call. Here you can see that the mean is 2.0 and the variance is 0.0.

stats1.add(2)
stats1.add(2)
stats1.add(2)
print('stats1: %s' % stats1)
print('statistics: %s' % stats1.statistics())
stats1: dstats: 2 moments, 1 columns, 3 rows
[[ 2.]
 [ 0.]]
statistics: [[ 2.]
 [ 0.]]

Add entire rows (multiple columns) of values to the stats2 object. View the accumulated results. Note that when the second moment (n * Var) is 0, equivalent to a deviation of 0, the higher moments are left in there initial 0 state. The higher statistics are set to a NaN value in this case.

stats2.add([1.2,2,3,9])
stats2.add([4.5,6,7,9])
stats2.add([8.9,0,1,9])
stats2.add([2.3,4,5,9])
print('stats2: %s' % stats2)
print('statistics: %s' % stats2.statistics(True))
stats2: dstats: 4 moments, 4 columns, 4 rows
[[  4.22500000e+00   3.00000000e+00   4.00000000e+00   9.00000000e+00]
 [  3.47875000e+01   2.00000000e+01   2.00000000e+01   0.00000000e+00]
 [  6.73818750e+01   7.10542736e-15   7.10542736e-15   0.00000000e+00]
 [  5.75139658e+02   1.64000000e+02   1.64000000e+02   0.00000000e+00]]
statistics: [[  4.22500000e+00   3.00000000e+00   4.00000000e+00   9.00000000e+00]
 [  2.94904646e+00   2.23606798e+00   2.23606798e+00   0.00000000e+00]
 [  6.56807734e-01   1.58882186e-16   1.58882186e-16              nan]
 [ -1.09897921e+00  -1.36000000e+00  -1.36000000e+00              nan]]

Remove data (UNIMPLEMENTED) from the stats2 object.

# stats2.remove(1.2,2,3,9)

Load the stats3 object with with data and view the results.

stats3.add(4)
stats3.add(4)
stats3.add(4)
print('stats3: %s' % stats3)
print('statistics: %s' % stats3.statistics())
stats3: dstats: 2 moments, 1 columns, 3 rows
[[ 4.]
 [ 0.]]
statistics: [[ 4.]
 [ 0.]]

Now aggregate that object onto the first. This only works when the shapes are the same.

stats1.aggregate(stats3)
print('stast1: %s' % stats1)
print('statistics: %s' % stats1.statistics(True))
stast1: dstats: 2 moments, 1 columns, 6 rows
[[ 3.]
 [ 6.]]
statistics: [[ 3.]
 [ 1.]]

History

0.1 (2016-11-13)

  • First release on PyPI

0.2 (2016-11-13)

  • Corrected some setup configuration issues

0.3 (2016-11-14)

  • Added support and tests for serialization

0.4 (2017-1-4)

  • Added repl() and str() support
  • Added exceptions for unsupported methods and unsupported moments
  • Handle divide by zero on a per column basis
  • Improved setup processing
  • Extended testing
    • started migrating to factored test dependencies
    • test columns with 0 variance
    • added SciPy for evaluating expected skew and kurtosis values
    • raise exceptions for unsupported moments
  • Extensive documentation updates
    • added Makefile to generate documentation and create README
    • removed optional files
    • changed to classic theme
    • extended content and examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pebaystats, version 0.4
Filename, size File type Python version Upload date Hashes
Filename, size pebaystats-0.4.tar.gz (18.8 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page