Skip to main content

Python package for *fast* TDigest calculation.

Project description

pytest doc

PyTDigest: Fast streaming calculation of approximate quantiles

Python package for fast TDigest calculation.

  • C implementation and thin Python wrapper

  • suitable for big data and streaming and distributed settings

  • developed for smooth compatibility with Pandas and numpy

Based on previous work of Ted Dunning and Andrew Werner.

Basic example

from pytdigest import TDigest
import numpy as np
import pandas as pd

rng = np.random.default_rng(12354)
n = 100_000
x = rng.normal(loc=0, scale=10, size=n)
w = rng.exponential(scale=1, size=n)

# estimation from data is simple:
td = TDigest.compute(x, w, compression=1_000)
# td now contains "compressed" distribution: centroids with their means and weights

# TDigest can be used to provide mean or approximate quantiles (i.e. inverse CDF):
td.mean
quantiles = [0., 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.]
td.inverse_cdf(quantiles)

# these results are close to numpy values (note that numpy provides only unweighted quantiles)
np.average(x, weights=w)  # mean should be exact
np.quantile(x, quantiles)

# TDigest can be copied
td2 = td.copy()

# and multiple TDigests can be added together to provide approximate quantiles for the overall dataset
td + td2

Performance

Ted Dunning’s original algorithm in Java takes about ~140 ns per addition on average. This package needs about ~200 ns per addition when called from Python on numpy arrays, so the performance is fairly comparable with the original implementation. All other tested TDigest implementations in Python are much slower.

import numpy as np
from pytdigest import TDigest
import time

rng = np.random.Generator(np.random.PCG64(12345))

for n in [100_000, 1_000_000, 10_000_000]:
    x = rng.normal(size=n)
    w = rng.exponential(size=n)

    start = time.time()
    td = TDigest.compute(x, w)
    end = time.time()
    print(f'PyTDigest, n = {n:,}: {end - start:.3g} seconds')

# PyTDigest, n = 100,000: 0.0222 seconds
# PyTDigest, n = 1,000,000: 0.21 seconds
# PyTDigest, n = 10,000,000: 2.02 seconds

Similar packages

Several Python packages or wrappers exist for the TDigest algorithm.

tdigest

The most popular on GitHub is a pure Python tdigest package. Pure Python implementation is indeed very slow – more than 100x slower than this package:

import numpy as np
from pytdigest import TDigest
from tdigest import TDigest as TDigestPython
import time

rng = np.random.Generator(np.random.PCG64(12345))
n = 100_000
x = rng.normal(size=n)
w = rng.exponential(size=n)

start = time.time()
td = TDigest.compute(x, w)
end = time.time()
print(f'PyTDigest: {end - start:.3g} seconds')
# PyTDigest: 0.0246 seconds

tdp = TDigestPython()
start = time.time()
tdp.batch_update(x)
end = time.time()
print(f'TDigest: {end - start:.3g} seconds')
# TDigest: 7.26 seconds

Different weights for can be used in tdigest only with update method for adding a single observation.

t-digest CFFI

Other package is t-digest CFFI, a thin Python wrapper over C implementation. It does not pass batch updates into the C layer, so the iteration has to be done in python:

import numpy as np
from tdigest import TDigest as TDigestCFFI
import time

rng = np.random.Generator(np.random.PCG64(12345))
n = 100_000
x = rng.normal(size=n)

tdcffi = TDigestCFFI()
start = time.time()
for xx in x:
    tdcffi.insert(xx)
end = time.time()
print(f'TDigest-CFFI: {end - start:.3g} seconds')

Hence this package is still almost 20x slower than this package when used over numpy arrays. In addition, t-digest CFFI package allows only for integer weights.

qtdigest

qtdigest’s own benchmarking states that 100 000 additions take about 1.7 s, so it is again almost 100x slower than this package.

tdigestc

tdigestc by ajwerner is a simple C implementation with wrappers for different languages. The Python wrapper is very basic, it is not published on PyPI and some functionality was missing in the underlying C implementation (for instance support for batch updates based on numpy arrays), so I took this package as the starting point and added several useful features for use as a standalone Python package.

Future plans

There are several improvements that can be done in the future:

  • TDigest can calculate exact variance in addition to mean.

  • Alternating merging procedure (the centroids are always merged left to right in the C implementation, however Ted Dunning states that alternating merging improves the precision).

  • Scaling function for merging centroids is hard-coded at the moment. Ted Dunning mentions several possible functions that can be used in merging.

  • Centroids can store information about their variance - the resulting TDigest should be still composable and fast and it can work much better for discrete distributions.

Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytdigest-0.1.4.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

pytdigest-0.1.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.8 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

File details

Details for the file pytdigest-0.1.4.tar.gz.

File metadata

  • Download URL: pytdigest-0.1.4.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for pytdigest-0.1.4.tar.gz
Algorithm Hash digest
SHA256 573dfee19bb0f26fc1f43d8a571b513f9d8f09ff1add98ce99826357e0070f0d
MD5 b9b6dd16613bb3aa30290ee09f97e612
BLAKE2b-256 fbaf6c94278db5800bb24366955426fd23b7f90576edb14dfd01dffd3fe8fcec

See more details on using hashes here.

File details

Details for the file pytdigest-0.1.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pytdigest-0.1.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e29e083ed4564fff0484ad8a9ba2e95d613bdef4d8c04caf05dbf455da76f59d
MD5 62e22d5a728a65da2c71edee5b60c6a1
BLAKE2b-256 22557a1bd79521e99226a004665c6ef4370cbd7c4e5b17fccb2617be02eff664

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page