A pandas extension for cleaning datasets.
Project description
Pandas-cleaner
Pandas-cleaner is a Python package, built on top of pandas, that provides methods detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).
Features
Pandas-cleaner offers functionnalities to automatically :
- detect different kind of potential errors in datasets such as outliers, inconsistencies, typos, wrong-typed ..., given predefined rules or statistiscal estimations, via an easy-to-use API extending pandas,
- analyze these errors, via reports and plots, to check the validity of the set and/or decide if any correction is needed,
- clean the datasets, either by dropping the lines with errors, emptying, correcting or replacing bad values,
- reapply the same rules to any other incoming fresh data.
Usage
Import the package
import pandas as pd
import pdcleaner
Create an example data series
series = pd.Series([1, 5, -6, 100, 10])
Detect the errors in the series with a given method (such as bounded
, iqr
, zscore
and many more depending the type of data...)
detector = series.cleaner.detect('bounded', lower=0, upper=10)
Inspect the result:
detector.report()
Detection report
==============================================================================
Method: bounded Nb samples: 5
Date: January 24,2022 Nb errors: 2
Time: 16:06:08 Nb rows with NaN: 0
------------------------------------------------------------------------------
lower 0 upper 10
inclusive both sided both
==============================================================================
Check the potential errors that have been detected
detector.detected()
2 -6
3 100
dtype: int64
Clean the detected errors from the series using the chosen method among drop
, to_na
, clip
, replace
...
series.cleaner.clean("drop", detector, inplace=True)
series
0 1
1 5
4 10
dtype: int64
Contributing to pandas-cleaner
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
Issues and bugs can be reported at https://github.com/eurodecision/pandas-cleaner/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pandas_cleaner-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c849b564fdd5c15c2531dd2f872fadf4a177a89c052a83c1302c6b9e445dd6f |
|
MD5 | 40fd5f27eeecb7bb6b10be83a8705d24 |
|
BLAKE2b-256 | 9733210095e8146a0df5d19b33c0e320d5f5bdcbb0b666400f7e10de197aa0b5 |