Skip to main content

Sanitipy is a user-friendly Python library designed for data cleaning and preprocessing. It provides essential utilities to streamline the process of preparing datasets for analysis or modeling. With features such as duplicate removal, handling missing values, and automatic data type inference, sanitipy simplifies the data cleaning workflow, making it a useful tool for data scientists and analysts.

Project description

SanitiPy - Automatic Data Cleaner

PyPI - Version PyPI status PyPI - Python Version PyPI - License

SanitiPy automates the data cleaning process for your data science projects using Python.

Overview

SanitiPy is a user-friendly Python library designed to streamline the data cleaning and preprocessing workflow. It provides essential utilities to prepare datasets for analysis or modeling by handling common data quality issues such as duplicate entries, missing values, and inconsistent data types.

Features

  • Remove Duplicates: Easily eliminate duplicate rows from your DataFrame to ensure data integrity.

  • Handle Missing Values: Automatically identify and remove rows containing NaN (Not a Number) values.

  • Infer Data Types: Intelligently detect and convert column data types, including:

    • Converting potential datetime columns based on a configurable ratio of valid dates.

    • Converting numeric-like values to proper numeric types.

    • Falling back to string type when type inference is unsuccessful.

  • Automated Cleaning Process: The DataCleaner class orchestrates the cleaning steps, ensuring your data is ready for further analysis.

Installation

You can install SanitiPy using pip:

pip install sanitipy

Useage

Quick example on using the package with a Pandas DataFrame:

import pandas as pd
from sanitipy import DataCleaner

# Create a sample DataFrame with some common data issues
data = {
  'ID': [1, 2, 3, 1, 4, 5],
  'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve'],
  'Value': [100, 200, None, 100, 400, 500],
  'Date': ['2023/01/01', '2023/01/02', '2023/01/03', '2023/01/01', 'invalid-date', '2023/01/05'],
  'Category': ['A', 'B', 'C', 'A', 'D', 'E']
}
df = pd.DataFrame(data)

# Initialize the DataCleaner
cleaner = DataCleaner(df)

# Clean the data
cleaned_df = cleaner.clean_data()

API Reference

DataCleaner Class

The main class for orchestrating the data cleaning process.

  • __init__(self, data_frame: pd.DataFrame):

    • Initializes the DataCleaner with a pandas DataFrame
  • clean_data(self) -> pd.DataFrame:

    • Performs a sequence of cleaning operations:
      • Removes duplicate rows.

      • Removes rows with missing values (if any are detected). Raises a ValueError if missing values persist after removal.

      • Infers and converts data types for columns with inconsistent types.

      • Resets the DataFrame index.

    • Returns the cleaned pandas DataFrame.

Preprocessor Class

Provides individual data transformation and cleaning utilities.

  • remove_duplicates(self, data: pd.DataFrame) -> pd.DataFrame:

    • Removes duplicate rows from the input DataFrame.
  • remove_na(self, data: pd.DataFrame) -> pd.DataFrame:

    • Removes rows containing any NaN values from the input DataFrame.
  • infer_data_types(self, data_frame: pd.DataFrame, date_time_ratio: float = 0.5) -> pd.DataFrame:

    • Infers and converts data types for columns.
    • date_time_ratio: The threshold (0-1) for treating an object column as datetime. Default is 0.5 (50% valid date values required).

Validator Class

Used internally by DataCleaner to check data quality.

  • check_missing_values(self, data: pd.DataFrame) -> int:

    • Retuns the total count of missing values in the DataFrame.
  • validate_data_types(self) -> bool:

    • Checks if all columns in the DataFrame have consistent data types. Returns True if all columns have the same data type or if the DataFrame is empty, False otherwise.

Contributing

Constributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.

License

This project is licensed under the GNU License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanitipy-1.1.0.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sanitipy-1.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file sanitipy-1.1.0.tar.gz.

File metadata

  • Download URL: sanitipy-1.1.0.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for sanitipy-1.1.0.tar.gz
Algorithm Hash digest
SHA256 300fec2c23d64b2c1cf797e78563be9e7ded65a22c81099fcb392d7b17d27bbe
MD5 bb8b16e91da575c0631774ed85067bca
BLAKE2b-256 6a2e0a12d5450bbd9b65f576f526dd31bbea6917325117da0d8548e3bd60c841

See more details on using hashes here.

File details

Details for the file sanitipy-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: sanitipy-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for sanitipy-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 249274eda0497b8b4c7545247d539d82aa5fe298146c64985e34989e6e4bcd94
MD5 70cc4fa90f1f4e6ae46f1624250e23c7
BLAKE2b-256 ddb48eae498fbe185222ebf391002cf8cd161327539026ad86e508d01b9771c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page