Skip to main content

A Python package for exploring and cleaning Pandas DataFrames

Project description

PandasExplorer

Overview

The pandasdataexplorer.py file is a module within the PandasExplorer package. It provides a class PandasDataExplorer that encapsulates a variety of data preprocessing, exploration, and visualization utilities for Pandas DataFrames. These methods are designed to help users efficiently clean, transform, and analyze data using common tasks like renaming columns, handling missing values, and finding outliers, along with more advanced functionalities such as generating profile reports and plotting data distributions.

Table of Contents

Methods

Column Operations

  • clean_columns():

    • Cleans column names by making them lowercase and replacing spaces with underscores.
  • rename_columns(cols: list, new_names: list):

    • Renames specified columns by their indices.
    • Parameters:
      • cols: A list of column indices to rename.
      • new_names: A list of new column names.
  • remove_columns(col_indices):

    • Removes columns from the DataFrame by their indices.
    • Parameters:
      • col_indices: A list of column indices to remove.
  • change_column_dtype(col_number, type='int64'):

    • Changes the data type of a specified column by its index.
    • Parameters:
      • col_number: The index of the column.
      • type: The target data type (default is int64).
  • copy():

    • Creates a copy of the DataFrame.
  • save_copy(filename: str):

    • Saves the DataFrame copy to a CSV file.
    • Parameters:
      • filename: The path to the CSV file where the DataFrame will be saved.

Data Cleaning

  • clean_string_columns():

    • Trims and converts all string (object) columns to lowercase.
  • clean_float_columns():

    • Rounds all float columns to two decimal places.
  • parse_date_columns():

    • Attempts to convert string columns to datetime based on several common formats.
  • parse_int_columns():

    • Attempts to convert string columns to integers or floats based on their contents.
  • drop_duplicate_rows():

    • Removes duplicate rows, keeping only the first occurrence.

Data Exploration

  • show(rows=5):

    • Displays the first n rows of the DataFrame.
    • Parameters:
      • rows: Number of rows to display (default is 5).
  • get_info():

    • Returns basic information about the DataFrame, including column types and non-null counts.
  • find_outliers(column_number):

    • Finds outliers in the specified column using the IQR (Interquartile Range) method.
    • Parameters:
      • column_number: The index of the column to check for outliers.

Outlier Handling

  • drop_outliers(column_number):
    • Removes outliers in a specified column using the IQR method.
    • Parameters:
      • column_number: The index of the column where outliers should be dropped.

Missing Values

  • find_missing_values(pct=False):

    • Returns the count (or percentage) of missing values in each column.
    • Parameters:
      • pct: If True, returns missing values as a percentage, otherwise returns as counts.
  • drop_missing_values(cols=None):

    • Drops rows with missing values. Can drop rows with missing values only in specified columns.
    • Parameters:
      • cols: A list of column indices. If None, rows with any missing values are dropped.

Grouping and Aggregation

  • groupby_categorical(groupby, col, func='sum', sort_descending=True):

    • Groups the DataFrame by a specified column and applies an aggregation function to another column.
    • Parameters:
      • groupby: Index of the column to group by.
      • col: Index of the column to aggregate.
      • func: Aggregation function (sum, min, max, count, avg).
      • sort_descending: Whether to sort the result in descending order (default is True).
  • count_distinct(groupby, col):

    • Counts distinct values of a column within each group.
    • Parameters:
      • groupby: Index of the column to group by.
      • col: Index of the column for which distinct values will be counted.

Visualization

  • show_numerical_distribution():
    • Plots histograms for all numerical columns using Plotly.

Reports

  • generate_profile_report():
    • Generates a profile report of the DataFrame using the pandas_profiling library and saves it as profile-report.html.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandasdataexplorer-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PandasDataExplorer-0.1.0-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file pandasdataexplorer-0.1.0.tar.gz.

File metadata

  • Download URL: pandasdataexplorer-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.0

File hashes

Hashes for pandasdataexplorer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 84e19f0a8544b9b46c2a35755063dc161f59a247809853038a2ff7816350341c
MD5 7facc1b3221693be3b400f3b3197c914
BLAKE2b-256 f22a31a5d66a00795bd791b4acb589231dfc53b1455024c075890619343b11ee

See more details on using hashes here.

File details

Details for the file PandasDataExplorer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for PandasDataExplorer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d7292dfd73a5e20a0b976a0b73fdc4156beefe4852964b4769d749e78adaad7
MD5 041b6337bc2e0887c087e18b1c57ccc0
BLAKE2b-256 3990bdf2bd01a253ae81126503111dd1fe2d8f54d9f3f39357b7366a42a3cba9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page