clean-df

Python module to report, clean, and optimize Pandas Dataframes effectively

These details have not been verified by PyPI

Project links

Homepage

Project description

clean_df

https://img.shields.io/pypi/l/clean_df.svg

Python module to report, clean, and optimize Pandas Dataframes effectively.

Full Documentation Here.

Description and Features

The first step of any data analysis project is to check and clean the data, in this module I implemented a very effiecint code that can:

Report your Pandas DataFrame to decide for actions, this report will show:
1. The column which has a unique value.
2. The duplicated rows.
3. The datatypes of columns that can optimize memory (based on columns’ values).
4. The outliers.
5. The missing values (table, matrix, and heatmap).
Clean the dataframe by dropping columns that have a high ratio of missing values, rows with missing values, and duplicated rows in the dataframe.
Optimize the dataframe by converting columns to the desired data type and converting categorical columns to ‘category’ data type.

Installation

To install clean_df, run this command in your terminal:

$ pip install clean_df

For more information on installation details for this project, please see the docs/installation.rst file.

Usage

This module is very easy to use, for a full detailed example please see the docs/usage.rst file.

Importing the module

from clean_df import CleanDataFrame

Defining the class

Pass your pandas dataframe to CleanDataFrame class:

cdf = CleanDataFrame(
        df=df,             # the dataframe to be cleaned
        max_num_cat=5      # maximum number of unique values in a column to be
        )                  # converted to categorical datatype, default is 5

Reporting

Call report method to see a full report about the dataframe (unique value columns, duplications, columns to optimize its data types, outliers, and missing values:

cdf.report(
        show_matrix=True,   # show matrix missing values (from missingno package), default is True
        show_heat=True,     # show heat missing values (from missingno package), default is True
        matrix_kws={},      # if need to pass any arguments to matrix plot, default is {}
        heat_kws={}         # if need to pass any arguments to heat plot, default is {}
        )

Cleaning

Call clean method to drop single value columns, high number of missing value columns, duplicated rows, and rows with missing values:

cdf.clean(
        min_missing_ratio=0.05,    # the minimum ratio of missing values to drop a column, default is 0.05
        drop_nan=True              # if True, drop the rows with missing values after dropping columns
                                   # with missingsa above min_missing_ratio
        drop_kws={},               # if need to pass any arguments to pd.DataFrame.drop(), default is {}
        drop_duplicates_kws={}     # same drop_kws, but for drop_duplicates function
        )

Optimizing

Call optimize method to optimize the dataframe by changing columns’ data types based on its values for maximum memory savings:

cdf.optimize()

Accessing the Cleaned Data DataFrame

cdf.df

Contributing

See the CONTRIBUTING.rst for contribution details. Feel free to contact me for any subject through my:

Also, you are welcomed to visit my personal blog .

License

Free software: MIT license.

Documentation

The full documentation is hosted on my website, and on ReadTheDocs.
The source code is available in GitHub.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
Here are additional resources I got a lot from them.

History

0.2.0 (2023-03-02)

Add a new report for categorical columns.
Make the module more efficient.

0.1.1 (2023-02-27)

Rectify and organize documentation.
Provide test to GitHub Actions.

0.1.0 (2023-02-27)

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.0

Aug 22, 2023

0.2.3

Mar 4, 2023

0.2.2

Mar 3, 2023

0.2.1

Mar 3, 2023

This version

0.2.0

Mar 2, 2023

0.1.1

Feb 27, 2023

0.1.0

Feb 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clean_df-0.2.0-py2.py3-none-any.whl (12.3 kB view details)

Uploaded Mar 2, 2023 Python 2Python 3

File details

Details for the file clean_df-0.2.0-py2.py3-none-any.whl.

File metadata

Download URL: clean_df-0.2.0-py2.py3-none-any.whl
Upload date: Mar 2, 2023
Size: 12.3 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for clean_df-0.2.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`fa1cdb7592ac526a7b2344cfa95f05ec3284afa540095c8b88c9887d16a67601`
MD5	`489404c9bd6384b4843c2faa923d2860`
BLAKE2b-256	`8080220ed589a80df70d4b8ce21d6916641de98a8de68845b8d52eca30cc9730`

See more details on using hashes here.

clean-df 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

clean_df

Description and Features

Installation

Usage

Importing the module

Defining the class

Reporting

Cleaning

Optimizing

Accessing the Cleaned Data DataFrame

Contributing

License

Documentation

Credits

History

0.2.0 (2023-03-02)

0.1.1 (2023-02-27)

0.1.0 (2023-02-27)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes