Learned sample-based estimator for number of distinct values.
Project description
Learned NDV estimator
Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically. See our VLDB 2022 paper Learning to be a Statistician: Learned Estimator for Number of Distinct Values for more details.
How to use
-
Install the package
pip install estndv
-
Import and create an instance
from estndv import ndvEstimator
estimator = ndvEstimator()
-
Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:
ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)
-
If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:
ndv = estimator.profile_predict(f=[2,1,1], N=100000)
-
If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method
estimator.sample_predict_batch()
orestimator.profile_predict_batch()
.
How to train the ndv estimator
You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:
-
Go to the model_training folder
cd model_training
-
Install requirements
pip install requirements.txt
-
Generate training data. (This uses a lot of memory.)
python training_data_generation.py
-
Train model
python model_training.py
-
Save trained pytorch model parameters to numpy, this generates a file model_paras.npy
python torch2npy.py
-
Test with your model parameters by specifying a path to your model_paras.npy
estimator = ndvEstimator(para_path=your path to model_paras.npy)
Citation
If you use our work or found it useful, please cite our paper:
@article{wu2022learning,
author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
year = {2021},
issue_date = {October 2021},
publisher = {VLDB Endowment},
volume = {15},
number = {2},
issn = {2150-8097},
url = {https://doi.org/10.14778/3489496.3489508},
doi = {10.14778/3489496.3489508},
journal = {Proc. VLDB Endow.},
month = {oct},
pages = {272–284},
numpages = {13}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file estndv-0.0.3.tar.gz
.
File metadata
- Download URL: estndv-0.0.3.tar.gz
- Upload date:
- Size: 188.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72305925ac8516f227971bd312760ce84de9a2853ccc6a89b727cd2e28fed0c8 |
|
MD5 | 8845d56764c313c5607e8d34ea45997e |
|
BLAKE2b-256 | 6bdce6cf8628c0bba89a2dc7887dccf27cf6ae8a0cb0217905b7671cdfb8f152 |
File details
Details for the file estndv-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: estndv-0.0.3-py3-none-any.whl
- Upload date:
- Size: 188.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44f00b89844b55a03eaff6eeac04a38812e71bc76ff041d5a55964c174d2b2d6 |
|
MD5 | 8660e691888aafda1dd7f7a01f110ba7 |
|
BLAKE2b-256 | de9f2a0599b9a085eb0f527993a504037931754f320e62b60446670c60769b46 |