Skip to main content

A pandas extension for cleaning datasets.

Project description

Pandas-cleaner

Documentation Status

Pandas-cleaner is a Python package, built on top of pandas, that provides methods detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).

Features

Pandas-cleaner offers functionnalities to automatically :

  • detect different kind of potential errors in datasets such as outliers, inconsistencies, typos, wrong-typed ..., given predefined rules or statistiscal estimations, via an easy-to-use API extending pandas,
  • analyze these errors, via reports and plots, to check the validity of the set and/or decide if any correction is needed,
  • clean the datasets, either by dropping the lines with errors, emptying, correcting or replacing bad values,
  • reapply the same rules to any other incoming fresh data.

Usage

Import the package

import pandas as pd
import pdcleaner

Create an example data series

series = pd.Series([1, 5, -6, 100, 10])

Detect the errors in the series with a given method (such as bounded, iqr, zscore and many more depending the type of data...)

detector = series.cleaner.detect('bounded', lower=0, upper=10)

Inspect the result:

detector.report()
                                 Detection report                               
==============================================================================
Method:                      bounded      Nb samples:                        5
Date:                January 24,2022      Nb errors:                         2
Time:                       16:06:08      Nb rows with NaN:                  0
------------------------------------------------------------------------------
lower                              0      upper                             10
inclusive                       both      sided                           both
==============================================================================

Check the potential errors that have been detected

detector.detected()
 2     -6
 3    100
 dtype: int64

Clean the detected errors from the series using the chosen method among drop, to_na, clip , replace...

series.cleaner.clean("drop", detector, inplace=True)
   series
 0      1
 1      5
 4     10
 dtype: int64

Contributing to pandas-cleaner

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Issues and bugs can be reported at https://github.com/eurodecision/pandas-cleaner/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas-cleaner-0.0.3.tar.gz (34.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandas_cleaner-0.0.3-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file pandas-cleaner-0.0.3.tar.gz.

File metadata

  • Download URL: pandas-cleaner-0.0.3.tar.gz
  • Upload date:
  • Size: 34.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pandas-cleaner-0.0.3.tar.gz
Algorithm Hash digest
SHA256 e38545ca3c589b253877f910b5dccc3d67613696db25c1921f9f283a8abb7df3
MD5 4b5d0e80eac1fcd57fbdf4ec421fecbf
BLAKE2b-256 9a234be7268aed09bda81b13fcbd9b7efd9fa61d875153c862a8c33860f71e15

See more details on using hashes here.

File details

Details for the file pandas_cleaner-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: pandas_cleaner-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pandas_cleaner-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1c849b564fdd5c15c2531dd2f872fadf4a177a89c052a83c1302c6b9e445dd6f
MD5 40fd5f27eeecb7bb6b10be83a8705d24
BLAKE2b-256 9733210095e8146a0df5d19b33c0e320d5f5bdcbb0b666400f7e10de197aa0b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page