Skip to main content

A cardinality estimator using the Flajolet-Martin algorithm

Project description

Cardinality Estimation using the Flajolet-Martin algorithm


Imagine if you have data and you wish to know its cardinality (the number of unique elements present).Calcalating the cardinality can have multiple uses, e.g to count the number of unique visitors on a website.

You could code a simple algorithm to loop through the dataset and check whether each element only occurs once or not, but this isn't feasible if you have a gigantic dataset with terabytes worth of data which can not even fit in your computer's memory. To solve this problem, one can use Cardinality Estimators which give a very close estimate of a dataset's cardinality at the cost of minimal memory usage.

The Idea Behind it


Provided that you have a large dataset, the probability of seeing a hashed item (converted to its binary form) which ends with an x amount of zeroes is 2^x. For example, if you need 3 zeroes at the end, the probability is 0.5 * 0.5 * 0.5 = 0.125 because each bit is either a 0 or a 1. Hence, on average, you would get one number with 3 zeroes at the end out of 1/0.5^x numbers which is the equivalent of 2^x. This is another simple method of estimating the cardinality, but it can give drastically inaccurate results if a hashed value has too many zeroes, and its estimates are in the power of 2 (256, 512, 1024...). One improvement on this method is to use various hash functions and average the estimates given to us, but a variety of hash functions is computationally expensive. To bypass this computational constraint, we can use a method called stochastic averaging where we divide the output of a single hash function into two parts. The number of buckets where we store the maximum number of 0's denoted by m and number of bits used to calculate which bucket we store our maximum number of 0's denoted by k. The accuracy of this algorithm, according to the paper it is based on, boils down to 1.3/sqrt(m) where m is the number of buckets so depending on the accuracy you want, you can vary the value of m but not to extremely large values. The reason being that your value of k (which is the number of bits used in the hashed value to calculate the bucket index) is determined by log(m), and you don't want to use a large number of bits for k because it reduces accuracy. For example, if you have a binary value of 10000100000 and you use a value of 1024 for m, then you only get 1 as the maximum number of 0's instead of 5.

An example of values for m/k for a binary input of 10010000 using k as 2 would result in using the left 2 most bits to calculate the bucket number (10 corresponds to bucket no.2) and using the remaining bits to count the number of ending 0's (which are 4 in this case), and then storing the number of ending 0's in that bucket.

Note: This algorithm produces predictably larger estimates; hence, to correct for the bias, the final output is multiplied by a constant of 0.79402 which was derived through statistical analyses by Mariannae Durand and Phillipie Flajolet

Usage


import cardinality_cs110
data = ['sample data here']
print (cardinality_cs110.flajolet_martin(data,k))
'output is the estimated number of unique elements'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cardinality_cs110-0.0.9.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

cardinality_cs110-0.0.9-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file cardinality_cs110-0.0.9.tar.gz.

File metadata

  • Download URL: cardinality_cs110-0.0.9.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for cardinality_cs110-0.0.9.tar.gz
Algorithm Hash digest
SHA256 503b00837e5fdfd2b8d7816e1fda82c3b6b71932d12978d8de25248c623ae79b
MD5 b202e61d5eb6388e237a65e862ee33aa
BLAKE2b-256 b379d95ce2f4da833c6acc3f9691c769a35c935d785d9bdcb43ad244ae1cd6aa

See more details on using hashes here.

File details

Details for the file cardinality_cs110-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: cardinality_cs110-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for cardinality_cs110-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 3fd9b9a5f49dee542c5e5f738ce4ae261d22f662f51e9b4c209f53f277559cc3
MD5 017bde0837f343798dbbee3733b4d9e7
BLAKE2b-256 e6e179b1628152bbfbb4e9074233036192adc452fcdead2a0ffd923ef8489ff7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page