Skip to main content

Learned sample-based estimator for number of distinct values.

Project description

Learned NDV estimator

Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically. See our VLDB 2022 paper Learning to be a Statistician: Learned Estimator for Number of Distinct Values for more details.

How to use

  1. Install the package

    pip install estndv

  2. Import and create an instance

   from estndv import ndvEstimator
   estimator = ndvEstimator()
  1. Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:

    ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)

  2. If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:

    ndv = estimator.profile_predict(f=[2,1,1], N=100000)

  3. If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method estimator.sample_predict_batch() or estimator.profile_predict_batch().

How to train the ndv estimator

You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:

  1. Go to the model_training folder cd model_training

  2. Install requirements

    pip install requirements.txt

  3. Generate training data. (This uses a lot of memory.)

    python training_data_generation.py

  4. Train model

    python model_training.py

  5. Save trained pytorch model parameters to numpy, this generates a file model_paras.npy

    python torch2npy.py

  6. Test with your model parameters by specifying a path to your model_paras.npy

    estimator = ndvEstimator(para_path=your path to model_paras.npy)

Citation

If you use our work or found it useful, please cite our paper:

@article{wu2022learning,
   author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
   title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
   year = {2021},
   issue_date = {October 2021},
   publisher = {VLDB Endowment},
   volume = {15},
   number = {2},
   issn = {2150-8097},
   url = {https://doi.org/10.14778/3489496.3489508},
   doi = {10.14778/3489496.3489508},
   journal = {Proc. VLDB Endow.},
   month = {oct},
   pages = {272–284},
   numpages = {13}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

estndv-0.0.3.tar.gz (188.6 kB view details)

Uploaded Source

Built Distribution

estndv-0.0.3-py3-none-any.whl (188.0 kB view details)

Uploaded Python 3

File details

Details for the file estndv-0.0.3.tar.gz.

File metadata

  • Download URL: estndv-0.0.3.tar.gz
  • Upload date:
  • Size: 188.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.6

File hashes

Hashes for estndv-0.0.3.tar.gz
Algorithm Hash digest
SHA256 72305925ac8516f227971bd312760ce84de9a2853ccc6a89b727cd2e28fed0c8
MD5 8845d56764c313c5607e8d34ea45997e
BLAKE2b-256 6bdce6cf8628c0bba89a2dc7887dccf27cf6ae8a0cb0217905b7671cdfb8f152

See more details on using hashes here.

File details

Details for the file estndv-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: estndv-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 188.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.6

File hashes

Hashes for estndv-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 44f00b89844b55a03eaff6eeac04a38812e71bc76ff041d5a55964c174d2b2d6
MD5 8660e691888aafda1dd7f7a01f110ba7
BLAKE2b-256 de9f2a0599b9a085eb0f527993a504037931754f320e62b60446670c60769b46

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page