Python module to report, clean, and optimize Pandas Dataframes effectively
Project description
clean_df
Python module to report, clean, and optimize Pandas Dataframes effectively.
Full Documentation Here.
Description and Features
The first step of any data analysis project is to check and clean the data, in this module I implemented a very effiecint code that can:
Automatically drop columns that have a unique value (these columns are useless, so it will be dropped).
Report your Pandas DataFrame to decide for actions, this report will show:
Duplicated rows report.
Columns’ Datatype to optimize memory report.
Columns to convert to categorical report.
Outliers report.
Missing values report.
Clean the dataframe by dropping columns that have a high ratio of missing values, rows with missing values, and duplicated rows in the dataframe.
Optimize the dataframe by converting columns to the desired data type and converting categorical columns to ‘category’ data type.
Installation
To install clean_df, run this command in your terminal:
$ pip install clean_df
For more information on installation details for this project, please see the docs/installation.rst file.
Usage
This module is very easy to use, for a full detailed example please see the docs/usage.rst file.
Importing the module
from clean_df import CleanDataFrame
Defining the class
Pass your pandas dataframe to CleanDataFrame class:
cdf = CleanDataFrame( df=df, # the dataframe to be cleaned max_num_cat=5 # maximum number of unique values in a column to be ) # converted to categorical datatype, default is 5
Reporting
Call report method to see a full report about the dataframe (duplications, columns to optimize its data types, categorical columns, outliers, and missing values:
cdf.report( show_matrix=True, # show matrix missing values (from missingno package), default is True show_heat=True, # show heat missing values (from missingno package), default is True matrix_kws={}, # if need to pass any arguments to matrix plot, default is {} heat_kws={} # if need to pass any arguments to heat plot, default is {} )
Cleaning
Call clean method to drop high number of missing value columns, duplicated rows, and rows with missing values:
cdf.clean( min_missing_ratio=0.05, # the minimum ratio of missing values to drop a column, default is 0.05 drop_nan=True # if True, drop the rows with missing values after dropping columns # with missingsa above min_missing_ratio drop_kws={}, # if need to pass any arguments to pd.DataFrame.drop(), default is {} drop_duplicates_kws={} # same drop_kws, but for drop_duplicates function )
Optimizing
Call optimize method to optimize the dataframe by changing columns’ data types based on its values for maximum memory savings:
cdf.optimize()
Accessing the Cleaned Data DataFrame
cdf.df
Contributing
See the CONTRIBUTING.rst for contribution details. Feel free to contact me for any subject through my:
Also, you are welcomed to visit my personal blog .
License
Free software: MIT license.
Documentation
The full documentation is hosted on my website, and on ReadTheDocs.
The source code is available in GitHub.
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
Here are additional resources I got a lot from them.
History
0.2.1 (2023-03-03)
Improve module performance.
0.2.0 (2023-03-02)
Add a new report for categorical columns.
Make the module more efficient.
0.1.1 (2023-02-27)
Rectify and organize documentation.
Provide test to GitHub Actions.
0.1.0 (2023-02-27)
First release on PyPI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for clean_df-0.2.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff11cb31a0fc43d3e6dd1ea6e13a1039e58461827feded4c7a4173e54098a647 |
|
MD5 | 733bc2cfe3a8247c10b1e4cf282a8b07 |
|
BLAKE2b-256 | 10ff6f3f584c3f2712cbca2324c93b8f6f3165178903b361002e9544f7c56443 |