Skip to main content

Python module to report, clean, and optimize Pandas Dataframes effectively

Project description

clean_df

https://img.shields.io/pypi/v/clean_df.svg https://github.com/NaelAqel/clean_df/actions/workflows/test.yml/badge.svg Documentation Status https://img.shields.io/pypi/l/clean_df.svg

Python module to report, clean, and optimize Pandas Dataframes effectively.

Full Documentation Here.

Description and Features

The first step of any data analysis project is to check and clean the data, in this module I implemented a very effiecint code that can:

  • Automatically drop columns that have a unique value (these columns are useless, so it will be dropped).

  • Report your Pandas DataFrame to decide for actions, this report will show:

    1. Duplicated rows report.

    2. Columns’ Datatype to optimize memory report.

    3. Columns to convert to categorical report.

    4. Outliers report.

    5. Missing values report.

  • Clean the dataframe by dropping columns that have a high ratio of missing values, rows with missing values, and duplicated rows in the dataframe.

  • Optimize the dataframe by converting columns to the desired data type and converting categorical columns to ‘category’ data type.

Installation

To install clean_df, run this command in your terminal:

$ pip install clean_df

For more information on installation details for this project, please see the docs/installation.rst file.

Usage

This module is very easy to use, for a full detailed example please see the docs/usage.rst file.

Importing the module

from clean_df import CleanDataFrame

Defining the class

Pass your pandas dataframe to CleanDataFrame class:

cdf = CleanDataFrame(
        df=df,             # the dataframe to be cleaned
        max_num_cat=5      # maximum number of unique values in a column to be
        )                  # converted to categorical datatype, default is 5

Reporting

Call report method to see a full report about the dataframe (duplications, columns to optimize its data types, categorical columns, outliers, and missing values:

cdf.report(
        show_matrix=True,   # show matrix missing values (from missingno package), default is True
        show_heat=True,     # show heat missing values (from missingno package), default is True
        matrix_kws={},      # if need to pass any arguments to matrix plot, default is {}
        heat_kws={}         # if need to pass any arguments to heat plot, default is {}
        )

Cleaning

Call clean method to drop high number of missing value columns, duplicated rows, and rows with missing values:

cdf.clean(
        min_missing_ratio=0.05,    # the minimum ratio of missing values to drop a column, default is 0.05
        drop_nan=True              # if True, drop the rows with missing values after dropping columns
                                   # with missingsa above min_missing_ratio
        drop_kws={},               # if need to pass any arguments to pd.DataFrame.drop(), default is {}
        drop_duplicates_kws={}     # same drop_kws, but for drop_duplicates function
        )

Optimizing

Call optimize method to optimize the dataframe by changing columns’ data types based on its values for maximum memory savings:

cdf.optimize()

Accessing the Cleaned Data DataFrame

cdf.df

Contributing

See the CONTRIBUTING.rst for contribution details. Feel free to contact me for any subject through my:

Also, you are welcomed to visit my personal blog .

License

Free software: MIT license.

Documentation

Credits

History

0.3.0 (2023-08-23)

  • Improve the performance when calling report method.

  • The pytest now is including the full methods in the module.

0.2.3 (2023-03-04)

  • Improve memory consumption and module performance.

0.2.2 (2023-03-03)

  • Fix a bug that made “dict_keys” error in some speical cases.

0.2.1 (2023-03-03)

  • Improve module performance.

0.2.0 (2023-03-02)

  • Add a new report for categorical columns.

  • Make the module more efficient.

0.1.1 (2023-02-27)

  • Rectify and organize documentation.

  • Provide test to GitHub Actions.

0.1.0 (2023-02-27)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clean_df-0.3.0.tar.gz (411.0 kB view details)

Uploaded Source

Built Distribution

clean_df-0.3.0-py2.py3-none-any.whl (11.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file clean_df-0.3.0.tar.gz.

File metadata

  • Download URL: clean_df-0.3.0.tar.gz
  • Upload date:
  • Size: 411.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for clean_df-0.3.0.tar.gz
Algorithm Hash digest
SHA256 defe0284ddf9352d6d6ced16e6e9408337561f018bd8bf3b365a63511028360b
MD5 cffeb5992f714964a78fe6f14e615c33
BLAKE2b-256 ea09522afe48a2f2bc41dedce50a7fcd00398114ade3bec40fd2a0285179cf41

See more details on using hashes here.

File details

Details for the file clean_df-0.3.0-py2.py3-none-any.whl.

File metadata

  • Download URL: clean_df-0.3.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for clean_df-0.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 26085edad095995e96f12c6e9e4ee523ebf5477103e22f474dc2a3f731bb682d
MD5 67a8f37f6be096e99af1d28ce3e8aeea
BLAKE2b-256 61324712907b66148e9977ccfa76efb1b441a2664d759e99ed3e47f5994ad786

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page