Skip to main content

Online parallel statistics calculator.

Project description

opstats

Python implementation of an online parallel statistics calculator. This library will calculate the total, mean, variance, standard deviation, skewness and kurtosis. There are additional options for calculating covariance and correlation between two sequences of data points.

Online calculation is appropriate when you don't yet have the entire dataset in order to calculate the mean (e.g. in a streaming environment). It is more processor-intensive than the traditional methods however.

When combined with parallel computation, it can also be useful when the data is very large as it works in a single pass and can be distributed.

Installation

pip install opstats

Usage

Moment Calculator

For calculating the mean, variance (and standard deviation), skewness and kurtosis, use the MomentCalculator.

import random
from opstats import MomentCalculator
data_points = random.sample(range(1, 100), 20)
calc = MomentCalculator()
for d in data_points:
    calc.add(d)

result = calc.get()

The result will be a NamedTuple containing the computed moments up until this point. More data can subsequently be added and the result can be retrieved again.

Parallel Processing

Data can be split into multiple parts and processed in parallel. The resulting statistics can be combined using the aggregate_moments function.

from opstats import aggregate_moments
# Divide the sample data in half.
left_data = data_points[:len(data_points)//2]
right_data = data_points[len(data_points)//2:]
# Create stats for each half. 
left = MomentCalculator()
for d in left_data:
    left.add(d)

right = MomentCalculator()
for d in right_data:
    right.add(d)

# Combine the results.
result = aggregate_moments([left.get(), right.get()])

Covariance and Correlation

The CovarianceCalculator class and aggregate_covariance function work in the same manner as above for calculating the covariance and correlation between two sequences of data points.

Extended Statistics

When installed with pip install opstats[extended], cardinality and percentiles can also be estimated. Cardinality will be estimated with HyperLogLog and percentiles with T-Digest.

from opstats.extended import ExtendedCalculator
calc = ExtendedCalculator()
for d in data_points:
    calc.add(d)

result = calc.get()

This can also be calculated in parallel. Note the changes to using get_parallel() which returns an intermediate object and calculate() which computes the final values.

from opstats.extended import aggregate_extended
# Divide the sample data in half.
left_data = data_points[:len(data_points)//2]
right_data = data_points[len(data_points)//2:]
# Create stats for each half. 
left = ExtendedCalculator()
for d in left_data:
    left.add(d)

right = ExtendedCalculator()
for d in right_data:
    right.add(d)

# Combine the results.
result = aggregate_extended([left.get_parallel(), right.get_parallel()]).calculate()

Credits

Online calculator adapted from: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance (Terriberry, Timothy B)

Aggregation translated from: https://rdrr.io/cran/utilities/src/R/sample.decomp.R

Python HyperLogLog implementation: https://github.com/svpcom/hyperloglog

Python T-Digest implementation: https://github.com/CamDavidsonPilon/tdigest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opstats-1.2.0.tar.gz (50.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opstats-1.2.0-py3-none-any.whl (41.9 kB view details)

Uploaded Python 3

File details

Details for the file opstats-1.2.0.tar.gz.

File metadata

  • Download URL: opstats-1.2.0.tar.gz
  • Upload date:
  • Size: 50.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for opstats-1.2.0.tar.gz
Algorithm Hash digest
SHA256 d81f8a4edc317fae46590725a52bfe350e51ca699aac15e83ff921d9f58c0f54
MD5 c86011acb16c5825bda6535b23ec366b
BLAKE2b-256 35ea7df622c2828d93ab4e23213b1c1d427400e35acd176f65e86f60740bad73

See more details on using hashes here.

File details

Details for the file opstats-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: opstats-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 41.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for opstats-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c54b703d58e6a49a8a74ff2016492ebd4fc5fbfe34d0be65aee30c92e0316a1
MD5 9a42797f37bf6230255f33af2fd2b500
BLAKE2b-256 64a7662cc16902dba1009f78b8dfc40abad7e1b190b8a6add088aabfdc68d799

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page