Skip to main content

Streaming statistical moments

Project description

statmoments

Fast streaming univariate and bivariate moments and t-statistics.

statmoments is a high-performance library for computing univariate and bivariate statistical moments in a single pass over large waveform datasets with thousands of sample points. It can produce Welch's t-test statistics for hypothesis testing on arbitrary data partitions.

Features

  • Streaming processing for both univariate and bivariate analysis
  • Efficient memory usage through dense matrix representation
  • High numerical accuracy
  • Command-line interface for analysis of existing datasets

How is it different?

When input data differences are subtle, millions of waveforms may have to be processed to find the statistically significant difference, requiring efficient algorithms. In addition to that, the high-order moment computation need multiple passes and may require starting over once new data appear. With thousands of sample points per waveform, the problem becomes more complex.

A streaming algorithm processes sequences of inputs in a single pass as they are collected. When fast enough, it's suitable for real-time sources like oscilloscopes, sensors, and financial markets, as well as for large datasets that don't fit in memory. The dense matrix representation of an intermediate accumulator reduces memory requirements. The accumulator can be converted to co-moments and Welch's t-test statistics on demand. Data batches can be iteratively processed to increase precision and then discarded. The library handles significant input streams, processing hundreds of megabytes per second.

Yet another dimension can be added when the data split is unknown. In other words, which bucket the input waveform belongs to. This library solves this with given pre-classification of the input data and computing moments for all the requested data splits.

Some of the benefits of streaming computation include:

  • Real-time insights for trend identification and anomaly detection
  • Reduced data processing latency, crucial for time-sensitive applications
  • Scalability to handle large data volumes, essential for data-intensive research in fields like astrophysics and financial analysis

Numeric accuracy

The numeric accuracy of results depends on the coefficient of variation (COV) of a sample point in the input waveforms. With a COV of about 5%, the computed (co-)kurtosis has about 10 correct significant digits for 10,000 waveforms, sufficient for Welch's t-test. Increasing data by 100x loses one more significant digit.

Examples

Performing univariate data analysis

  # Input data parameters
  tr_count = 100   # M input waveforms
  tr_len   = 5     # N features or points in the input waveforms
  cl_len   = 2     # L hypotheses how to split input waveforms

  # Create engine, which can compute up to kurtosis
  uveng = statmoments.Univar(tr_len, cl_len, moment=4)

  # Process input data and split hypotheses
  uveng.update(wforms1, classification1)

  # Process more input data and split hypotheses
  uveng.update(wforms2, classification2)

  # Get statistical moments
  mean       = [cm.copy() for cm in uveng.moments(moments=1)]  # E(X)
  skeweness  = [cm.copy() for cm in uveng.moments(moments=3)]  # E(X^3)

  # Detect statistical differences in the first-order t-test
  for i, tt in enumerate(statmoments.stattests.ttests(uveng, moment=1)):
    if np.any(np.abs(tt) > 5):
      print(f"Data split {i} has different means")

  # Process more input data and split hypotheses
  uveng.update(wforms3, classification3)

  # Get updated statistical moments and t-tests
  # with statmoments.stattests.ttests(uveng, moment=1)

Performing bivariate data analysis

  # Input data parameters
  tr_count = 100   # M input waveforms
  tr_len = 5       # N features or points in the input waveforms
  cl_len = 2       # L hypotheses how to split input waveforms

  # Create bivariate engine, which can compute up to co-kurtosis
  bveng = statmoments.Bivar(tr_len, cl_len, moment=4)

  # Process input data and split hypotheses
  bveng.update(wforms1, classification1)

  # Process more input data and split hypotheses
  bveng.update(wforms2, classification2)

  # Get bivariate moments
  covariance    = [cm.copy() for cm in bveng.comoments(moments=(1, 1))]  # E(X Y)
  cokurtosis22  = [cm.copy() for cm in bveng.comoments(moments=(2, 2))]  # E(X^2 Y^2)
  cokurtosis13  = [cm.copy() for cm in bveng.comoments(moments=(1, 3))]  # E(X^1 Y^3)

  # univariate statistical moments are also can be obtained
  variance   = [cm.copy() for cm in bveng.moments(moments=2)]  # E(X^2)

  # Detect statistical differences in the second order t-test (covariances)
  for i, tt in enumerate(statmoments.stattests.ttests(bveng, moment=(1,1))):
    if np.any(np.abs(tt) > 5):
      print(f"Found stat diff in the split {i}")

  # Process more input data and split hypotheses
  bveng.update(wforms3, classification3)

  # Get updated statistical moments and t-tests
  # with statmoments.stattests.ttests(bveng, moment=(1,1))

Performing data analysis from the command line

# Find univariate t-test statistics of skeweness for the first
# 5000 waveform sample points, stored in a HDF5 dataset
python -m statmoments.univar -i data.h5 -m 3 -r 0:5000

# Find bivariate t-test statistics of covariance for the first
# 1000 waveform sample points, stored in a HDF5 dataset
python -m statmoments.bivar -i data.h5 -r 0:1000

More examples can be found in the examples and tests directories.

Implementation Notes

statmoments uses top BLAS implementations, including GPU based on nvmath-python if available, for the best peformance on Windows, Linux and Macs,to maximize computational efficiency.

Due to RAM limits, results are produced one at a time for each input classifier as the set of statistical moments. Each classifier's output moment has dimensions 2 x M x L, where M is an index of the requested classifier and L is the region length.

The bivariate results, co-moments and t-tests, are represented by the upper triangle of the symmetric matrix as 1D array for each classifier.

Installation

pip install statmoments

References

Anton Kochepasov, Ilya Stupakov, "An Efficient Single-pass Online Computation of Higher-Order Bivariate Statistics", 2024 IEEE International Conference on Big Data (BigData), 2024, pp. 123-129, IEEE Xplore.

@INPROCEEDINGS{10825659,
  author={Stupakov, Ilya and Kochepasov, Anton},
  booktitle={2024 IEEE International Conference on Big Data (BigData)},
  title={An Efficient Single-pass Online Computation of Higher-Order Bivariate Statistics},
  year={2024},
  pages={123-129},
  doi={10.1109/BigData62323.2024.10825659}
}

PyPi Version PyPI pyversions

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statmoments-1.1.1.tar.gz (410.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

statmoments-1.1.1-cp312-cp312-win_amd64.whl (720.2 kB view details)

Uploaded CPython 3.12Windows x86-64

statmoments-1.1.1-cp312-cp312-macosx_10_13_universal2.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

statmoments-1.1.1-cp310-cp310-win_amd64.whl (725.4 kB view details)

Uploaded CPython 3.10Windows x86-64

statmoments-1.1.1-cp310-cp310-macosx_13_0_x86_64.whl (806.2 kB view details)

Uploaded CPython 3.10macOS 13.0+ x86-64

statmoments-1.1.1-cp36-cp36m-win_amd64.whl (327.8 kB view details)

Uploaded CPython 3.6mWindows x86-64

statmoments-1.1.1-cp36-cp36m-macosx_10_14_x86_64.whl (376.0 kB view details)

Uploaded CPython 3.6mmacOS 10.14+ x86-64

File details

Details for the file statmoments-1.1.1.tar.gz.

File metadata

  • Download URL: statmoments-1.1.1.tar.gz
  • Upload date:
  • Size: 410.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for statmoments-1.1.1.tar.gz
Algorithm Hash digest
SHA256 487ac038b69ae841323fb7c40d6bc96d5771a73dbc7cce05d44d4fa23414578b
MD5 8a86137cf9ce1943f9277c9dc9a669a4
BLAKE2b-256 590129c771732831a26d6a9f501a8e3e46ec65b0a26292e5b5434abf140c475d

See more details on using hashes here.

File details

Details for the file statmoments-1.1.1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for statmoments-1.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0517465c442fbadd1b3aa1182672929bff51e6959c1b749c67a9de14a3ddce22
MD5 500b0ea1a108856fa475679abd335d67
BLAKE2b-256 0efa17124208ddb467a93b80225e5b5dc9237855c15fe9ef042c7a8229674ca8

See more details on using hashes here.

File details

Details for the file statmoments-1.1.1-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for statmoments-1.1.1-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 7790523b1b2dac3e2c3b10648a11895a3593262e426d6d9a8f2747a1c6824d57
MD5 d4242baf1aa7686c51aaf7d4f289f1f4
BLAKE2b-256 3bcb9325f682b8e638ccd01378bd817811855e4e75f2bbbb9a7e06d95febba90

See more details on using hashes here.

File details

Details for the file statmoments-1.1.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for statmoments-1.1.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 2ace3d0bcee395aa2c45776b6ba3f305cf416483f6f02c384eb68d49fc3c23ea
MD5 c62c0c25e1b8c57b7d15062e29f37919
BLAKE2b-256 bf28a43ef2443235329414542a6d7fdffc22c3bce3c34bd5cd3b6c376c9e82ed

See more details on using hashes here.

File details

Details for the file statmoments-1.1.1-cp310-cp310-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for statmoments-1.1.1-cp310-cp310-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 7003367abf7d6a5de76c41680b1eb61eb56c7863f1b0137d58369820bbb4fa5a
MD5 c7b0ffedcd522c9961ec9883ad4caa69
BLAKE2b-256 54dd483398f526c44e3bb61769814ebfee7c1b9543f3f5db13c066aa7e48c747

See more details on using hashes here.

File details

Details for the file statmoments-1.1.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: statmoments-1.1.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 327.8 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for statmoments-1.1.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 9274f1663304702e64be63a42dbc9f446c1a98053c3dbb7019e8fae5897b74cd
MD5 c83b3a2a93ac977c6fc582d1409ddbe7
BLAKE2b-256 193042ae3d36f918ac895a74c72e09ab90f7a4e458f44ef17d229a802a7d9b6b

See more details on using hashes here.

File details

Details for the file statmoments-1.1.1-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for statmoments-1.1.1-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 7d54f388b09f17ecf6a52fb510c0095930a518d0562dae6db68607e82afd8c13
MD5 42aad7523e7f1cdc0070952e17fd7f2b
BLAKE2b-256 dbdd8779ddb3bd7325612c98bf692a5d71fe803001244edc94cc80e0c9ae0388

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page