Skip to main content

Use the CVM algorithm to quickly estimate the number of distinct elements in a stream

Project description

A Python implementation of the CVM Algorithm for Counting Distinct Elements

This library implements the algorithm described in

Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). Distinct Elements in Streams: An Algorithm for the (Text) Book. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34

The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

This Python module leverages the cvmcount Rust library: https://github.com/urschrei/cvmcount

What does that mean

The count-distinct problem, or cardinality-estimation problem refers to counting the number of distinct elements in a data stream with repeated elements. As a concrete example, imagine that you want to count the unique words in a book. If you have enough memory, you can keep track of every unique element you encounter. However, you may not have enough working memory due to resource constraints, or the number of potential elements may be enormous. This constraint is referred to as the bounded-storage constraint in the literature.

In order to overcome this constraint, streaming algorithms have been developed: Flajolet-Martin, LogLog, HyperLogLog. The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an unbiased estimate of the cardinality.

What is an Element

Any hashable Python object or primitive. Not f32 / f64, however, as they don't form a total ordering due to the presence of NaN.

Further Details

Don Knuth has written about the algorithm (he refers to it as Algorithm D) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a treap – as a binary tree

"that’s capable of holding up to s ordered pairs (a, u), where a is an element of the stream and u is a real number, 0 ≤ u < 1."

where s >= 1. Our implementation doesn't use a treap as a buffer; it uses a fast HashSet with the FxHash algorithm: we pay the hash cost when inserting, but search in step D4 is O(1). The library may switch to a treap implementation eventually.

Installation

pip install count_distinct

Usage

from count_distinct import CVM

# values for epsilon, delta, and stream size are described in the docstring.
counter = CVM(0.8, 0.1, 1000)
counter.add(2)
counter.add(5)
# keep adding elements
count = counter.calculate_final_result()
# you can keep adding elements if you wish

Perf

Memory, 10e8 random 7-digit positive integers

Allocation of integer array: 763 MiB

count_distinct: 768 MiB

np.unique: 1.7 GiB

Note

If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

count_distinct-0.1.14.tar.gz (11.5 kB view hashes)

Uploaded Source

Built Distributions

count_distinct-0.1.14-cp310-abi3-win_amd64.whl (137.4 kB view hashes)

Uploaded CPython 3.10+ Windows x86-64

count_distinct-0.1.14-cp310-abi3-win32.whl (128.0 kB view hashes)

Uploaded CPython 3.10+ Windows x86

count_distinct-0.1.14-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (242.6 kB view hashes)

Uploaded CPython 3.10+ manylinux: glibc 2.17+ x86-64

count_distinct-0.1.14-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (247.4 kB view hashes)

Uploaded CPython 3.10+ manylinux: glibc 2.17+ ARM64

count_distinct-0.1.14-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (248.5 kB view hashes)

Uploaded CPython 3.10+ manylinux: glibc 2.5+ i686

count_distinct-0.1.14-cp310-abi3-macosx_11_0_arm64.whl (212.4 kB view hashes)

Uploaded CPython 3.10+ macOS 11.0+ ARM64

count_distinct-0.1.14-cp310-abi3-macosx_10_12_x86_64.whl (217.8 kB view hashes)

Uploaded CPython 3.10+ macOS 10.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page