Skip to main content

A pandas extension for cleaning datasets.

Project description

Pandas-cleaner

Documentation Status

Pandas-cleaner is a Python package, built on top of pandas, that provides methods detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).

Features

Pandas-cleaner offers functionnalities to automatically :

  • detect different kind of potential errors in datasets such as outliers, inconsistencies, typos, wrong-typed ..., given predefined rules or statistiscal estimations, via an easy-to-use API extending pandas,
  • analyze these errors, via reports and plots, to check the validity of the set and/or decide if any correction is needed,
  • clean the datasets, either by dropping the lines with errors, emptying, correcting or replacing bad values,
  • reapply the same rules to any other incoming fresh data.

Usage

Import the package

import pandas as pd
import pdcleaner

Create an example data series

series = pd.Series([1, 5, -6, 100, 10])

Detect the errors in the series with a given method (such as bounded, iqr, zscore and many more depending the type of data...)

detector = series.cleaner.detect('bounded', lower=0, upper=10)

Inspect the result:

detector.report()
                                 Detection report                               
==============================================================================
Method:                      bounded      Nb samples:                        5
Date:                January 24,2022      Nb errors:                         2
Time:                       16:06:08      Nb rows with NaN:                  0
------------------------------------------------------------------------------
lower                              0      upper                             10
inclusive                       both      sided                           both
==============================================================================

Check the potential errors that have been detected

detector.detected()
 2     -6
 3    100
 dtype: int64

Clean the detected errors from the series using the chosen method among drop, to_na, clip , replace...

series.cleaner.clean("drop", detector, inplace=True)
   series
 0      1
 1      5
 4     10
 dtype: int64

Contributing to pandas-cleaner

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Issues and bugs can be reported at https://github.com/eurodecision/pandas-cleaner/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas-cleaner-0.0.3.tar.gz (34.7 kB view hashes)

Uploaded Source

Built Distribution

pandas_cleaner-0.0.3-py3-none-any.whl (44.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page