Skip to main content

No project description provided

Project description

CVM

Python implementation of the CVM algorithm which estimates the number of unique items in a stream. The algorithm is memory efficient and can estimate unique number of elements quite accurately depending on the memory allocated to the algorithm.

A deterministic algorithm which counts the number of unique elements will need to store all unique elements seen so far. If the number of unique elements is high, then a lot of memory will be used.

The CVM algorithm however lets you set a limit to how many elements you want to store in the memory (known as the threshold in the paper) and is still able to estimate the number of unique elements seen so for. The more elements you let the algorithm store, the better the estimate becomes.

For a list with ~150K unique elements, this algorithm is able to estimate the number of unique elements to about 2% error while storing only 1K elements. In other words, the algorithm achieved only ~2% error in estimating the number of unique elements by using only the memomry required to store 1K elements rather than 150K elements. Keep in mind the algorithm is probabilistic, you will see a different estimate each time.

For more details -

Installation

The package only requires python = "^3.6". There are no other dependencies that require installation. Install the package using the following command

pip install cvm_count

Usage

The following code shows how to use the package.

import cvm_count

def test_estimate(word_list):
    
    #create a CVM class object, specifying the threshold - the number of elemenets that can be stored in a set.
    counter = cvm_count.CVM(threshold=1000)

    for word in word_list:
        #use the method `record` to go through each element
        counter.record(word)

    actual_num_unique = len(set(word_list))
    #use the method `estimate` to estimate the number unique elements seen so far.
    estimated_num_unique = counter.estimate()

    print('Actual =', actual_num_unique, ', Estimated =', estimated_num_unique)
    print('Difference =', actual_num_unique - estimated_num_unique)
    print('Difference perc=', round(100*(actual_num_unique - estimated_num_unique)/actual_num_unique, 2))
    print('Num rounds =', counter._num_rounds)

The methods record and estimate can be used multiple times in a code in any order. Any datatype which can be handled by the Set() data structure in Python can be used here. For example, in the record method, we can pass int, float and str datatypes at the same time.

The package can also help estimate the count elements in a situation where record is being called from different threads/processes. Refer to the example below.

from joblib import Parallel, delayed
import cvm_count 

def test_estimate_multi_thread(word_list):

    counter = cvm_count.CVM(1000, multi_threading=True)
    Parallel(n_jobs=10, prefer="threads")(delayed(counter.record)(word) for word in word_list)

    actual_num_unique = len(set(word_list))
    estimated_num_unique = counter.estimate()

    print('Actual =', actual_num_unique, ', Estimated =', estimated_num_unique)
    print('Difference =', actual_num_unique - estimated_num_unique)
    print('Difference perc=', round(100*(actual_num_unique - estimated_num_unique)/actual_num_unique, 2))
    print('Num rounds =', counter._num_rounds)

The code above will take some time to execute because a lot of threads will be created and discarded while processing.

Issues

The package has not been stress tested. Use at your own risk. However, given the simplicity of the algorithm and the fact that python float has a 53 bit precision, I personally don't see any issue with numerical stability or any other thing for that matter.

Github

Link to the repo - CVM Python Implementation. The core implementation is present in this file and is quite small and easy to read. Please feel free to raise issues, PRs etc in the repo.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cvm_count-0.1.1.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

cvm_count-0.1.1-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file cvm_count-0.1.1.tar.gz.

File metadata

  • Download URL: cvm_count-0.1.1.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for cvm_count-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2e04ea4ff7cce4b02c931c259dbe49e61bdbdd7903d23e65f3635b43c5104dcf
MD5 5bf4b6e3f358f1365e775a697e8364b2
BLAKE2b-256 f9e4e93fe97b78115f76a54519ad9c440e77195236c62bc406eb677a7c7368fd

See more details on using hashes here.

File details

Details for the file cvm_count-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: cvm_count-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for cvm_count-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4b102d2552363b9775051cfe8c781a822cbb3151e01b25fd7bac90e48a3b0b5c
MD5 fce0513983b80c8fd4e1f22c35ddc606
BLAKE2b-256 0839ce36a0e1ec2a96c834eee38e39e5a81bccc52442461649186b8a04af2d8d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page