Skip to main content

Use the CVM algorithm to quickly estimate the number of distinct elements in a stream

Project description

A Python implementation of the CVM Algorithm for Counting Distinct Elements

This library implements the algorithm described in

Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). Distinct Elements in Streams: An Algorithm for the (Text) Book. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34

The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

This Python module leverages the cvmcount Rust library: https://github.com/urschrei/cvmcount

What does that mean

The count-distinct problem, or cardinality-estimation problem refers to counting the number of distinct elements in a data stream with repeated elements. As a concrete example, imagine that you want to count the unique words in a book. If you have enough memory, you can keep track of every unique element you encounter. However, you may not have enough working memory due to resource constraints, or the number of potential elements may be enormous. This constraint is referred to as the bounded-storage constraint in the literature.

In order to overcome this constraint, streaming algorithms have been developed: Flajolet-Martin, LogLog, HyperLogLog. The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an unbiased estimate of the cardinality.

What is an Element

Any hashable Python object or primitive. Not f32 / f64, however, as they don't form a total ordering due to the presence of NaN.

Further Details

Don Knuth has written about the algorithm (he refers to it as Algorithm D) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a treap – as a binary tree

"that’s capable of holding up to s ordered pairs (a, u), where a is an element of the stream and u is a real number, 0 ≤ u < 1."

where s >= 1. Our implementation doesn't use a treap as a buffer; it uses a fast HashSet with the FxHash algorithm: we pay the hash cost when inserting, but search in step D4 is O(1). The library may switch to a treap implementation eventually.

Installation

pip install count_distinct

Usage

from count_distinct import CVM

# values for epsilon, delta, and stream size are described in the docstring.
counter = CVM(0.8, 0.1, 1000)
counter.add(2)
counter.add(5)
# keep adding elements
count = counter.calculate_final_result()
# you can keep adding elements if you wish

Perf

Memory, 10e8 random 7-digit positive integers

Allocation of integer array: 763 MiB

count_distinct: 768 MiB

np.unique: 1.7 GiB

Note

If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

count_distinct-0.1.16.tar.gz (10.0 kB view details)

Uploaded Source

Built Distributions

count_distinct-0.1.16-cp310-abi3-win_amd64.whl (137.4 kB view details)

Uploaded CPython 3.10+ Windows x86-64

count_distinct-0.1.16-cp310-abi3-win32.whl (128.1 kB view details)

Uploaded CPython 3.10+ Windows x86

count_distinct-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (242.6 kB view details)

Uploaded CPython 3.10+ manylinux: glibc 2.17+ x86-64

count_distinct-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (247.4 kB view details)

Uploaded CPython 3.10+ manylinux: glibc 2.17+ ARM64

count_distinct-0.1.16-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl (248.6 kB view details)

Uploaded CPython 3.10+ manylinux: glibc 2.5+ i686

count_distinct-0.1.16-cp310-abi3-macosx_11_0_arm64.whl (212.5 kB view details)

Uploaded CPython 3.10+ macOS 11.0+ ARM64

count_distinct-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl (217.9 kB view details)

Uploaded CPython 3.10+ macOS 10.12+ x86-64

File details

Details for the file count_distinct-0.1.16.tar.gz.

File metadata

  • Download URL: count_distinct-0.1.16.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for count_distinct-0.1.16.tar.gz
Algorithm Hash digest
SHA256 6de705706762758271b79dc536bd717f929cb3b686036aab850a24f5a66cca85
MD5 c4f48e06fdca6a972546b3b0f74b8147
BLAKE2b-256 0d1ef7b8858139de46647f4a1c8775407b5e242d240eb01775c7239998e55c7f

See more details on using hashes here.

File details

Details for the file count_distinct-0.1.16-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for count_distinct-0.1.16-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7dbdea76bc46c29d5ae8f0594d1b4c388ee00492eb8304f7138d76304270f4cb
MD5 80e1fbd5d4dd8ea4bbfc673d9cf1c42b
BLAKE2b-256 c4230ed145b71eb45e94ebbd4b9646760cdc6277ac60eb0c9935167fdf02e4ec

See more details on using hashes here.

File details

Details for the file count_distinct-0.1.16-cp310-abi3-win32.whl.

File metadata

File hashes

Hashes for count_distinct-0.1.16-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 875f709d508c7e85a867ac32e7ccc397eef7f63b5351b4dcb97848ba5524b931
MD5 a658b51da3ce5eb853f84aa7c70790d6
BLAKE2b-256 f988b769098de15e35dc63be859c0a4efb5332a15208195f7e0d0240ad0b30dc

See more details on using hashes here.

File details

Details for the file count_distinct-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for count_distinct-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fc146f1e8da087b7a47b15229943de9a6c67ceb4dd7dc50720e51a7a006015a4
MD5 d8da289b10b2712081303e1d4306196d
BLAKE2b-256 5460d0f9961cc1e7dc22a04ba7b1617b5faa8e5b7d4f9ef064df1a6d378d946b

See more details on using hashes here.

File details

Details for the file count_distinct-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for count_distinct-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f0f9bb4b1bed68501cd743a8f54647af85b85283bf084b6fd956d776267e0b4d
MD5 843bb058634d5b1142fa3104f3565491
BLAKE2b-256 5352eb36e363a5ec3635ca2db53dccb280b8c0228ab733e86f46b2e4b3266349

See more details on using hashes here.

File details

Details for the file count_distinct-0.1.16-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for count_distinct-0.1.16-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 a4d1895648e7d85cf015a5cb2efae577073106f8268825a7b6cd1edb78750c34
MD5 bc547fa7103b1cebeb67d2d8d665033b
BLAKE2b-256 cb37ed8c173243a31eae78d89dcff8c539976778f9dad3a518aa878ffabd81f2

See more details on using hashes here.

File details

Details for the file count_distinct-0.1.16-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for count_distinct-0.1.16-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cf0cf9218ec45b45c9f7967904bc527511719fabd6cc4ffc5f3252da785d4168
MD5 ab534ed7022e8053904424cbfb33a730
BLAKE2b-256 0dade121477028cb9f29133599275a35bc56462a280df17c150fc4d64bb8d5a4

See more details on using hashes here.

File details

Details for the file count_distinct-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for count_distinct-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6195123c6ea62b9e60f7f89218a140a4e5b0af1d0c893b3f7939d9189758d5e3
MD5 0a65067dcc2342d7b2f881b83ae2c985
BLAKE2b-256 3b75152578c29c27c3e4f6e57ce115adedb4b0603d7ebc1f00ab8870b763ed7c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page