Skip to main content

Dataset size reduction using KNN Sampling algorithm

Project description

KNNSampler

KNNSampler is an implementation of the Research paper. It is created to help developers reduce the size of their Datasets by sampling the "Representatives" from the same. NN_SCORES and MNN_SCORES, as discussed in the referred paper, were used to find these "Representatives". KNNSampler works in both dynamic and static way, as discussed by the author in the paper.

Setup

  • Python 3.10

  • Requirenments : numpy, pandas, sklearn

Enhancements

  • MNN_SCORES are calculated after every iteration for the entire dataset in the algorithm suggested in the research paperwhich. This leads to redundant calculations. Hence, in this package we only calculate MNN_SCORES for the shortlisted rows using NN_SCORES, producing the same result as the original algorithm but in an optimal way.

  • Error was found in the line : train sample = train sample ∪ X[index] in the algorithm given in the research paper, we replace X[index] with X[train_index] for correct outcome.

  • Error was found in the Until loop logic of algorithm in the research paper : (NN − score(X) = 0) ∨ (| train sample |≤ k); The second condition must be |X| <= k, changes were done.

  • Values of t, m, s for (t,m,s)-nets were not provided in the paper, We give users the freedom to choose the t, m, and s values or use the default values provided.

Important

  • The dataset passed to the sample() function must NOT CONTAIN COLUMN NAMED "idx".

  • Warnings produced by "drop()" function in pandas.DataFrame must be IGNORED, since they have been added for debug purposes.

Navigate

Acknowledgement

I have "implemented" and "added optimizations" to the original research work done by : Bheekya Dharamsotu, K. Swarupa Rani, Salman Abdul Moiz, and C. Raghavendra Rao in the research paper :

B. Dharamsotu, K. S. Rani, S. Abdul Moiz and C. R. Rao, "k-NN Sampling for Visualization of Dynamic Data Using LION-tSNE," 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), Hyderabad, India, 2019, pp. 63-72, doi: 10.1109/HiPC.2019.00019.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

KNearestNeighborSampling-0.0.2.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

KNearestNeighborSampling-0.0.2-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file KNearestNeighborSampling-0.0.2.tar.gz.

File metadata

File hashes

Hashes for KNearestNeighborSampling-0.0.2.tar.gz
Algorithm Hash digest
SHA256 61042b02f59ad4108750bf4afcbf85b40bcba38f82c3c5f37f10f25c927353ea
MD5 17eab86d1efefbce1b44fdd78b41bdd5
BLAKE2b-256 d05f2f2eed1cd8b8355110ce8a3d04575ce55009185356a2f7d3c03e215ce99f

See more details on using hashes here.

File details

Details for the file KNearestNeighborSampling-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for KNearestNeighborSampling-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 46accfe7825d6ed8b85def5c533bb5da1a2b4fe5cdb46252624e97dd42681382
MD5 11232ab95635c4caa14c611e5adb5a3b
BLAKE2b-256 821662f5955f5dedc98ee5296ce50bf3ded1f5e910cf6637972c5c8d93ca58af

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page