Skip to main content

Dataset size reduction using KNN Sampling algorithm

Project description

KNNSampler

KNNSampler is an implementation of the Research paper. It is created to help developers reduce the size of their Datasets by sampling the "Representatives" from the same. NN_SCORES and MNN_SCORES, as discussed in the referred paper, were used to find these "Representatives". KNNSampler works in both dynamic and static way, as discussed by the author in the paper.

Setup

  • Python 3.10

  • Requirenments : numpy, pandas, sklearn

Enhancements

  • MNN_SCORES are calculated after every iteration for the entire dataset in the algorithm suggested in the research paperwhich. This leads to redundant calculations. Hence, in this package we only calculate MNN_SCORES for the shortlisted rows using NN_SCORES, producing the same result as the original algorithm but in an optimal way.

  • Error was found in the line : train sample = train sample ∪ X[index] in the algorithm given in the research paper, we replace X[index] with X[train_index] for correct outcome.

  • Error was found in the Until loop logic of algorithm in the research paper : (NN − score(X) = 0) ∨ (| train sample |≤ k); The second condition must be |X| <= k, changes were done.

  • Values of t, m, s for (t,m,s)-nets were not provided in the paper, We give users the freedom to choose the t, m, and s values or use the default values provided.

Important

  • The dataset passed to the sample() function must NOT CONTAIN COLUMN NAMED "idx".

  • Warnings produced by "drop()" function in pandas.DataFrame must be IGNORED, since they have been added for debug purposes.

Navigate

Acknowledgement

I have "implemented" and "added optimizations" to the original research work done by : Bheekya Dharamsotu, K. Swarupa Rani, Salman Abdul Moiz, and C. Raghavendra Rao in the research paper :

B. Dharamsotu, K. S. Rani, S. Abdul Moiz and C. R. Rao, "k-NN Sampling for Visualization of Dynamic Data Using LION-tSNE," 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), Hyderabad, India, 2019, pp. 63-72, doi: 10.1109/HiPC.2019.00019.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

KNearestNeighborSampling-0.0.1.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

KNearestNeighborSampling-0.0.1-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file KNearestNeighborSampling-0.0.1.tar.gz.

File metadata

File hashes

Hashes for KNearestNeighborSampling-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3d5ff6bea3832be38e0c1d6683669042099740008b7319bc989b7cd7a04fbe1c
MD5 d72b3f1fbff611f76d5d6da35a9445ad
BLAKE2b-256 475591536b45989e121c33eea92525dceb386922c1ec726bf476ddacbe599025

See more details on using hashes here.

File details

Details for the file KNearestNeighborSampling-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for KNearestNeighborSampling-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 96debabb8d7542a31ca2852ddc6a1f249e8ce1a929ccb4269126661ca0705255
MD5 bc9c70fcafb95a974c76fe8944335281
BLAKE2b-256 d5e4b27ce5c03e7c0593ace784772805cd6e0980d8da86eb8f4b2e2b152f4b19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page