A Python package for exploring and cleaning Pandas DataFrames
Project description
PandasExplorer
Overview
The pandasdataexplorer.py file is a module within the PandasExplorer package. It provides a class PandasDataExplorer that encapsulates a variety of data preprocessing, exploration, and visualization utilities for Pandas DataFrames. These methods are designed to help users efficiently clean, transform, and analyze data using common tasks like renaming columns, handling missing values, and finding outliers, along with more advanced functionalities such as generating profile reports and plotting data distributions.
Table of Contents
Methods
Column Operations
-
clean_columns():- Cleans column names by making them lowercase and replacing spaces with underscores.
-
rename_columns(cols: list, new_names: list):- Renames specified columns by their indices.
- Parameters:
cols: A list of column indices to rename.new_names: A list of new column names.
-
remove_columns(col_indices):- Removes columns from the DataFrame by their indices.
- Parameters:
col_indices: A list of column indices to remove.
-
change_column_dtype(col_number, type='int64'):- Changes the data type of a specified column by its index.
- Parameters:
col_number: The index of the column.type: The target data type (default isint64).
-
copy():- Creates a copy of the DataFrame.
-
save_copy(filename: str):- Saves the DataFrame copy to a CSV file.
- Parameters:
filename: The path to the CSV file where the DataFrame will be saved.
Data Cleaning
-
clean_string_columns():- Trims and converts all string (object) columns to lowercase.
-
clean_float_columns():- Rounds all float columns to two decimal places.
-
parse_date_columns():- Attempts to convert string columns to datetime based on several common formats.
-
parse_int_columns():- Attempts to convert string columns to integers or floats based on their contents.
-
drop_duplicate_rows():- Removes duplicate rows, keeping only the first occurrence.
Data Exploration
-
show(rows=5):- Displays the first
nrows of the DataFrame. - Parameters:
rows: Number of rows to display (default is 5).
- Displays the first
-
get_info():- Returns basic information about the DataFrame, including column types and non-null counts.
-
find_outliers(column_number):- Finds outliers in the specified column using the IQR (Interquartile Range) method.
- Parameters:
column_number: The index of the column to check for outliers.
Outlier Handling
drop_outliers(column_number):- Removes outliers in a specified column using the IQR method.
- Parameters:
column_number: The index of the column where outliers should be dropped.
Missing Values
-
find_missing_values(pct=False):- Returns the count (or percentage) of missing values in each column.
- Parameters:
pct: IfTrue, returns missing values as a percentage, otherwise returns as counts.
-
drop_missing_values(cols=None):- Drops rows with missing values. Can drop rows with missing values only in specified columns.
- Parameters:
cols: A list of column indices. IfNone, rows with any missing values are dropped.
Grouping and Aggregation
-
groupby_categorical(groupby, col, func='sum', sort_descending=True):- Groups the DataFrame by a specified column and applies an aggregation function to another column.
- Parameters:
groupby: Index of the column to group by.col: Index of the column to aggregate.func: Aggregation function (sum,min,max,count,avg).sort_descending: Whether to sort the result in descending order (default isTrue).
-
count_distinct(groupby, col):- Counts distinct values of a column within each group.
- Parameters:
groupby: Index of the column to group by.col: Index of the column for which distinct values will be counted.
Visualization
show_numerical_distribution():- Plots histograms for all numerical columns using Plotly.
Reports
generate_profile_report():- Generates a profile report of the DataFrame using the
pandas_profilinglibrary and saves it asprofile-report.html.
- Generates a profile report of the DataFrame using the
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pandasdataexplorer-0.1.0.tar.gz.
File metadata
- Download URL: pandasdataexplorer-0.1.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84e19f0a8544b9b46c2a35755063dc161f59a247809853038a2ff7816350341c
|
|
| MD5 |
7facc1b3221693be3b400f3b3197c914
|
|
| BLAKE2b-256 |
f22a31a5d66a00795bd791b4acb589231dfc53b1455024c075890619343b11ee
|
File details
Details for the file PandasDataExplorer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: PandasDataExplorer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d7292dfd73a5e20a0b976a0b73fdc4156beefe4852964b4769d749e78adaad7
|
|
| MD5 |
041b6337bc2e0887c087e18b1c57ccc0
|
|
| BLAKE2b-256 |
3990bdf2bd01a253ae81126503111dd1fe2d8f54d9f3f39357b7366a42a3cba9
|