Missing Data Imputation for Python
Project description
missingpy
missingpy
is a library for missing data imputation in Python. It has an
API consistent with scikit-learn, so users
already comfortable with that interface will find themselves in familiar
terrain. Currently, the library only supports k-Nearest Neighbors based
imputation but we plan to add other imputation tools in the future so
please stay tuned!
Installation
pip install missingpy
Example
from missingpy import KNNImputer
imputer = KNNImputer()
X_imputed = imputer.fit_transform(X)
Note: Please check out the imputer's docstring for more information.
k-Nearest Neighbors (kNN) Imputation
The KNNImputer
class provides imputation for completing missing
values using the k-Nearest Neighbors approach. Each sample's missing values
are imputed using values from n_neighbors
nearest neighbors found in the
training set. Note that if a sample has more than one feature missing, then
the sample can potentially have multiple sets of n_neighbors
donors depending on the particular feature being imputed.
Each missing feature is then imputed as the average, either weighted or
unweighted, of these neighbors. Where the number of donor neighbors is less
than n_neighbors
, the training set average for that feature is used
for imputation. The total number of samples in the training set is, of course,
always greater than or equal to the number of nearest neighbors available for
imputation, depending on both the overall sample size as well as the number of
samples excluded from nearest neighbor calculation because of too many missing
features (as controlled by row_max_missing
).
For more information on the methodology, see [1].
The following snippet demonstrates how to replace missing values,
encoded as np.nan
, using the mean feature value of the two nearest
neighbors of the rows that contain the missing values::
>>> import numpy as np
>>> from missingpy import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
[3. , 4. , 3. ],
[5.5, 6. , 5. ],
[8. , 8. , 7. ]])
References
- Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file missingpy-0.1.1.tar.gz
.
File metadata
- Download URL: missingpy-0.1.1.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd1cd03db92729fe7d266d11b460ac718a9d04a4e1326343a9d32cb107e3fb1f |
|
MD5 | a160de7d523f4fd3c1212d63caea9a7c |
|
BLAKE2b-256 | 9177125b9ec338ec3e9d08e857bfa85c90116e741b67987130e9cc6db3d7510c |
File details
Details for the file missingpy-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: missingpy-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 112729629372814d0b274510333f60020fba4cca3d2f5beaee6bbf27ce6fa0b6 |
|
MD5 | b6983ca297a440b471a025a658128165 |
|
BLAKE2b-256 | c6439e665e8f517d6e35e88f64c25308878532d9a43a3de16f6fad24bde7cbd8 |