Skip to main content

A Python package for cleaning and preprocessing data in pandas DataFrames

Project description

DataScrub

The DataClean class provides a set of methods to clean and process data in a pandas DataFrame. It includes functions for cleaning text data, handling missing values, performing scaling normalization, exploding data, parsing date columns, and translating text columns.

Class Initialization

To create an instance of the DataClean class, you need to provide the filepath of the data file (CSV or Excel) as an argument to the constructor. The class will automatically read the data into a pandas DataFrame based on the file extension.

Example:

cleaner = DataClean('data.csv')

Method: clean_data

The clean_data method cleans text data in the DataFrame. It takes a parameter columns that specifies which columns to clean. You can either pass 'all' to clean all columns or provide a list of specific column names to clean.

Example:

cleaned_data = cleaner.clean_data(['text_column1', 'text_column2'])

Method: handle_missing_values

The handle_missing_values method handles missing values in the DataFrame. It takes a parameter missing_values, which is a dictionary specifying the actions to be taken for each column with missing values. The keys of the dictionary are the column names, and the values are the operations to be performed.

Example:

missing_values = {'column1': 'replace missing value with 0', 'column2': 'drop'}
processed_data = cleaner.handle_missing_values(missing_values)

Method: perform_scaling_normalization

The perform_scaling_normalization method performs scaling normalization on numerical columns in the DataFrame using the Box-Cox transformation. Currently, this method is marked as 'NOT COMPLETE' in the code and does not contain the complete implementation.

Method: explode_data

The explode_data method splits and expands data in specified columns of the DataFrame. It takes a dictionary explode where the keys are column names, and the values are the separators for splitting.

Example:

explode_columns = {'column1': ',', 'column2': ';'}
exploded_data = cleaner.explode_data(explode_columns)

Method: dupli

The dupli method removes duplicate rows from the DataFrame.

Example:

unique_data = cleaner.dupli()

Method: parse_date_column

The parse_date_column method converts specified columns in the DataFrame to datetime format and formats them as 'YYYY-MM-DD'. It takes a list date_columns containing the names of the columns to be converted.

Example:

date_columns = ['date_column1', 'date_column2']
parsed_data = cleaner.parse_date_column(date_columns)

Method: translate_columns

The translate_columns method translates text in specified columns of the DataFrame to English using Google Translate. It takes a dictionary translations where the keys are column names, and the values are overwrite boolean values. If the overwrite value is True, the original column will be overwritten; otherwise, a new column with the translated text will be added.

Example:

column_translations = {'text_column1': True, 'text_column2': False}
translated_data = cleaner.translate_columns(column_translations)

Method: prep

The prep method is the main function to prepare and clean the DataFrame. It provides a convenient way to perform multiple cleaning and processing operations in a specific order. You can specify the operations using the following parameters:

  • clean: Columns to clean. Pass 'all' to clean all columns or provide a list of specific column names.
  • missing_values: Actions

to be taken on missing values. Pass a dictionary with column names as keys and operations as values.

  • perform_scaling_normalization_bool: Boolean value indicating whether to perform scaling normalization on numerical columns.
  • explode: Columns to be exploded. Pass a dictionary with column names as keys and separators for splitting as values.
  • parse_date: List of column names to be converted to datetime format.
  • translate_column_names: Dictionary mapping column names to overwrite boolean values for translation.

Example:

cleaned_data = cleaner.prep(clean='all', missing_values={'column1': 'drop'}, perform_scaling_normalization_bool=True, explode={'column2': ','}, parse_date=['date_column1'], translate_column_names={'text_column1': True})

Getting the Cleaned DataFrame

To obtain the cleaned and processed DataFrame, you can call the prep method and assign the returned DataFrame to a variable.

Example:

cleaned_data = cleaner.prep(clean='all', missing_values={'column1': 'drop'})

The variable cleaned_data will contain the final cleaned and processed DataFrame.

Please note that some methods in the code are marked as 'NOT COMPLETE' and require further implementation to work properly. You can modify and complete those methods as per your requirements.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datascrub-1.0.1b0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

datascrub-1.0.1b0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file datascrub-1.0.1b0.tar.gz.

File metadata

  • Download URL: datascrub-1.0.1b0.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for datascrub-1.0.1b0.tar.gz
Algorithm Hash digest
SHA256 6e2f0460ad78acf05577179d40c3bfdc71c84ca28a27a185b3fd7cb478e1d2bd
MD5 9695604942cec3712605d1458629ae96
BLAKE2b-256 b3774a7941d2601f55165478f0f8ee7cfbc7216abf0d07ad1d5a40aa70bf4dff

See more details on using hashes here.

File details

Details for the file datascrub-1.0.1b0-py3-none-any.whl.

File metadata

  • Download URL: datascrub-1.0.1b0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for datascrub-1.0.1b0-py3-none-any.whl
Algorithm Hash digest
SHA256 cf00a3ae57c5ec03fb90c9d6145760cdf14482fd6a6f05f7f27240cec525a664
MD5 58bef520066108aedff39a47db3220c1
BLAKE2b-256 5d13e3e6ff6323f182fabbed8017a26f3b451ba32e39a4664646c055f3236787

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page