A package for aid with data cleaning using pandas.

These details have not been verified by PyPI

Project links

Homepage

Project description

Pandas Data Cleaner

This package is a data cleaning tool for Pandas DataFrames and other objects with a similar structure.

The tool is designed to help clean data by providing a function onto which you can apply various cleaning methods.

The main cleaning function can be found in pandas_data_cleaner.base.clean_data.

The app also provides an abstract base class pandas_data_cleaner.base.CleaningStrategy which can be used to implement custom cleaning strategies.

Installation

To install the application, run the following command:

pip install pandas-data-cleaner

Cleaning Data

In order to clean data, you need:

Pandas DataFrame
List of strategies to apply
Any additional arguments that you may need to pass to the cleaning function.

Let's suppose we have the following DataFrame:

import pandas as pd

dataframe = pd.DataFrame({
    "id": [1, 2, 1],
    "structure_value": ["a", "a", "a"],
    "status": ["ENABLED", "ENABLED", "DISABLED"],
})

As a table, this would look like this:

id	structure_value	status
1	a	ENABLED
2	a	ENABLED
1	a	DISABLED

In this data frame, we can see that there are two rows with the same id but different values for status.

As part of our cleaning exercise, we want to keep the latest row of data as this is the most up-to-date.

Let's try to apply the RemoveDuplicates cleaning strategy to the data frame:

import pandas as pd
from pandas_data_cleaner.base import clean_data
from pandas_data_cleaner.strategies import RemoveDuplicates

dataframe = pd.DataFrame({
    "id": [1, 2, 1],
    "structure_value": ["a", "a", "a"],
    "status": ["ENABLED", "ENABLED", "DISABLED"],
})

dataframe = clean_data(dataframe, [RemoveDuplicates])

Running this will result in the following error:

pandas_data_cleaner.exceptions.MissingOptionsError: Missing kwargs:
remove_duplicates_subset_fields
remove_duplicates_keep

This lets us that we need to provide additional arguments when calling the cleaning function, these are:

remove_duplicates_subset_fields
remove_dupplicates_keep

To find out more information about the additional arguments required, you can run:

RemoveDuplicates.info()

This will return some information on how the strategy works as well as additional information on the arguments that are required.

For the RemoveDuplicates cleaning strategy, remove_duplicates_subset_fields is the fields we should perform the duplicate removal on and remove_duplicates_keep indicates given some duplicates are, which row should we keep.

If we now tweak our earlier code:

import pandas as pd
from pandas_data_cleaner.base import clean_data
from pandas_data_cleaner.strategies import RemoveDuplicates

dataframe = pd.DataFrame({
    "id": [1, 2, 1],
    "structure_value": ["a", "a", "a"],
    "status": ["ENABLED", "ENABLED", "DISABLED"],
})

dataframe = clean_data(
    dataframe,
    [RemoveDuplicates],
    remove_duplicates_subset_fields=["id"],
    remove_duplicates_keep="last"
)

We will now get the following data frame:

pd.DataFrame({
    "id": [2, 1],
    "structure_value": ["a", "a"],
    "status": ["ENABLED", "DISABLED"],
})

As a table:

id	structure_value	status
2	a	ENABLED
1	a	DISABLED

As we had set remove_duplicates_subset_fields=["id"], it found that there were two rows with the same ID. As we set remove_duplicates_keep="last", it kept the last row only.

In our example, we used only one cleaning strategy, but we are free to use as many as we like, we simply need to add all the strategies to the list of cleaning strategies to apply.

Creating Custom Cleaning Strategies

Let's suppose we intend to create a new cleaning strategy that removes certain columns.

We would create a new class inheriting from base.CleaningStrategy:

from pandas_data_cleaner.base import CleaningStrategy


class RemoveColumns(CleaningStrategy):
    pass

When using this strategy, we need to know which column names to remove. We will therefore decide that, when using this class in the clean_data method, we need to provide a remove_columns argument.

To do this, we simply create a class attribute called required_options and set it to ["remove_columns"].

We also will add some documentation to allow the end-user to receive some useful information when they run RemoveColumns.info().

Our new strategy will now look like this:

class RemoveColumns(CleaningStrategy):
    """Removes columns from a dataframe.

    Required options:
        `remove_columns` - (_t.List[str]) A list of columns to remove.
    """

    required_options = ["remove_columns"]

Now, we need to create our cleaning method. Once the cleaning method has been added, the class will look like the following:

class RemoveColumns(CleaningStrategy):
    """Removes columns from a dataframe.

    Required options:
        `remove_columns` - (List[str]) A list of columns to remove.
    """

    required_options = ["remove_columns"]

    def clean(self):
        """Executes the cleaning task."""
        self.dataframe.drop(
            self.remove_columns, axis=1, inplace=True
        )

Let's discuss how this cleaning method works. Firstly, whenever a user would use this strategy may run the following:

clean_data(dataframe, [RemoveColumns], remove_columns=["id", "status"])

clean_data will instantiate each cleaning strategy, in this case, just RemoveColumns providing the data frame as a required initial parameter as well as passing any keyword arguments to the function.

Each strategy would then set both the dataframe and each keyword argument to the self object.

This means that within the clean method, we would have access:

self.dataframe
self.remove_columns.

If the command the user ran was instead:

clean_data(dataframe, [RemoveColumns], remove_columns=["id", "status"], foo="bar")

Then within the clean method would have access:

self.dataframe
self.remove_columns
self.foo

By adding remove_columns to the required_options list, once this class is instantiated, we will be able to access self.remove_columns.

Now that we have built our cleaning strategy let's run it:

dataframe = pd.DataFrame({
    "id": [1, 2, 3],
    "col1": [1, 2, 3],
    "col2": [1, 2, 3],
    "col3": [1, 2, 3],
})

dataframe = clean_data(
    dataframe,
    [RemoveColumns],
    remove_columns=["col1", "col2"]
)

print(dataframe)

>>> pd.DataFrame({
    "id": [1, 2, 3],
    "col3": [1, 2, 3],
})

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.1

Feb 2, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas-data-cleaner-0.0.1.tar.gz (8.2 kB view details)

Uploaded Feb 2, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandas_data_cleaner-0.0.1-py3-none-any.whl (8.4 kB view details)

Uploaded Feb 2, 2022 Python 3

File details

Details for the file pandas-data-cleaner-0.0.1.tar.gz.

File metadata

Download URL: pandas-data-cleaner-0.0.1.tar.gz
Upload date: Feb 2, 2022
Size: 8.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for pandas-data-cleaner-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`477e87548754dab11d73c90bbee8288441ac19d4d21d6be559b028845e1ccaee`
MD5	`49c0156ac8b0878d4261fe00e38b9d87`
BLAKE2b-256	`e3858662545faef127a2293b4616540dcd2dbb452242c793da238f0f2e6bd6a6`

See more details on using hashes here.

File details

Details for the file pandas_data_cleaner-0.0.1-py3-none-any.whl.

File metadata

Download URL: pandas_data_cleaner-0.0.1-py3-none-any.whl
Upload date: Feb 2, 2022
Size: 8.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for pandas_data_cleaner-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ab57f327d77490ab2447abed772d3e431252ba63d97fb2d4af8dc94c68d2971`
MD5	`43d7b2a363d7fe510c0e17895bfe153a`
BLAKE2b-256	`9715b903a9fbb61eede76e674816449e18001575637505d4f44831be1d12a3f6`

See more details on using hashes here.

pandas-data-cleaner 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pandas Data Cleaner

Installation

Cleaning Data

Creating Custom Cleaning Strategies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes