Sanitipy is a user-friendly Python library designed for data cleaning and preprocessing. It provides essential utilities to streamline the process of preparing datasets for analysis or modeling. With features such as duplicate removal, handling missing values, and automatic data type inference, sanitipy simplifies the data cleaning workflow, making it a useful tool for data scientists and analysts.
Project description
SanitiPy - Automatic Data Cleaner
SanitiPy automates the data cleaning process for your data science projects using Python.
Overview
SanitiPy is a user-friendly Python library designed to streamline the data cleaning and preprocessing workflow. It provides essential utilities to prepare datasets for analysis or modeling by handling common data quality issues such as duplicate entries, missing values, and inconsistent data types.
Features
-
Remove Duplicates: Easily eliminate duplicate rows from your DataFrame to ensure data integrity.
-
Handle Missing Values: Automatically identify and remove rows containing
NaN(Not a Number) values. -
Infer Data Types: Intelligently detect and convert column data types, including:
-
Converting potential datetime columns based on a configurable ratio of valid dates.
-
Converting numeric-like values to proper numeric types.
-
Falling back to string type when type inference is unsuccessful.
-
-
Automated Cleaning Process: The
DataCleanerclass orchestrates the cleaning steps, ensuring your data is ready for further analysis.
Installation
You can install SanitiPy using pip:
pip install sanitipy
Useage
Quick example on using the package with a Pandas DataFrame:
import pandas as pd
from sanitipy import DataCleaner
# Create a sample DataFrame with some common data issues
data = {
'ID': [1, 2, 3, 1, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Eve'],
'Value': [100, 200, None, 100, 400, 500],
'Date': ['2023/01/01', '2023/01/02', '2023/01/03', '2023/01/01', 'invalid-date', '2023/01/05'],
'Category': ['A', 'B', 'C', 'A', 'D', 'E']
}
df = pd.DataFrame(data)
# Initialize the DataCleaner
cleaner = DataCleaner(df)
# Clean the data
cleaned_df = cleaner.clean_data()
API Reference
DataCleaner Class
The main class for orchestrating the data cleaning process.
-
__init__(self, data_frame: pd.DataFrame):- Initializes the
DataCleanerwith a pandas DataFrame
- Initializes the
-
clean_data(self) -> pd.DataFrame:- Performs a sequence of cleaning operations:
-
Removes duplicate rows.
-
Removes rows with missing values (if any are detected). Raises a
ValueErrorif missing values persist after removal. -
Infers and converts data types for columns with inconsistent types.
-
Resets the DataFrame index.
-
- Returns the cleaned pandas DataFrame.
- Performs a sequence of cleaning operations:
Preprocessor Class
Provides individual data transformation and cleaning utilities.
-
remove_duplicates(self, data: pd.DataFrame) -> pd.DataFrame:- Removes duplicate rows from the input DataFrame.
-
remove_na(self, data: pd.DataFrame) -> pd.DataFrame:- Removes rows containing any
NaNvalues from the input DataFrame.
- Removes rows containing any
-
infer_data_types(self, data_frame: pd.DataFrame, date_time_ratio: float = 0.5) -> pd.DataFrame:- Infers and converts data types for columns.
date_time_ratio: The threshold (0-1) for treating an object column as datetime. Default is 0.5 (50% valid date values required).
Validator Class
Used internally by DataCleaner to check data quality.
-
check_missing_values(self, data: pd.DataFrame) -> int:- Retuns the total count of missing values in the DataFrame.
-
validate_data_types(self) -> bool:- Checks if all columns in the DataFrame have consistent data types. Returns
Trueif all columns have the same data type or if the DataFrame is empty,Falseotherwise.
- Checks if all columns in the DataFrame have consistent data types. Returns
Contributing
Constributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.
License
This project is licensed under the GNU License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sanitipy-1.1.0.tar.gz.
File metadata
- Download URL: sanitipy-1.1.0.tar.gz
- Upload date:
- Size: 19.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
300fec2c23d64b2c1cf797e78563be9e7ded65a22c81099fcb392d7b17d27bbe
|
|
| MD5 |
bb8b16e91da575c0631774ed85067bca
|
|
| BLAKE2b-256 |
6a2e0a12d5450bbd9b65f576f526dd31bbea6917325117da0d8548e3bd60c841
|
File details
Details for the file sanitipy-1.1.0-py3-none-any.whl.
File metadata
- Download URL: sanitipy-1.1.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
249274eda0497b8b4c7545247d539d82aa5fe298146c64985e34989e6e4bcd94
|
|
| MD5 |
70cc4fa90f1f4e6ae46f1624250e23c7
|
|
| BLAKE2b-256 |
ddb48eae498fbe185222ebf391002cf8cd161327539026ad86e508d01b9771c4
|