Data-Sanitizer is a comprehensive Python package designed to streamline the process of cleaning and preprocessing tabular data using pandas. Whether you are dealing with missing values, duplicates, outliers, or need to encode categorical variables and scale numerical features, DataSanitizer provides a suite of easy-to-use tools to prepare your data for analysis and machine learning.
Project description
data-sanitizer data-sanitizer is a comprehensive Python package designed to streamline the process of cleaning and preprocessing tabular data using pandas. Whether you're dealing with missing values, duplicates, outliers, or need to encode categorical variables and scale numerical features, DataSanitizer provides a suite of easy-to-use tools to prepare your data for analysis and machine learning.
Features
- Handle Missing Values: Easily fill missing values with mean, median, mode, or drop them entirely.
- Remove Duplicates: Effortlessly identify and remove duplicate rows from your dataset.
- Remove Outliers: Detect and remove outliers using Interquartile Range (IQR) or Z-score methods.
- Convert Data Types: Seamlessly convert data types of specified columns to ensure consistency.
- Encode Categorical Variables: Perform one-hot encoding on categorical features to prepare them for machine learning models.
- Scale Numerical Features: Standardize or normalize numerical features to improve the performance of your algorithms.
Installation Install DataSanitizer easily using pip: pip install data-sanitizer
Usage
data-sanitizer integrates smoothly with pandas DataFrames, making it intuitive for users familiar with pandas.
Example
import pandas as pd from data-sanitizer import handle_missing_values, remove_duplicates, remove_outliers, convert_types, encode_categorical, scale_features
Sample DataFrame
df = pd.DataFrame({ 'A': [1, 2, None, 4], 'B': [None, 2, 3, 4], 'C': ['cat', 'dog', 'cat', 'mouse'], 'D': [10, 20, 30, 1000] })
Handling missing values
df = handle_missing_values(df, strategy='mean')
Removing duplicates
df = remove_duplicates(df)
Removing outliers
df = remove_outliers(df, columns=['D'], method='IQR')
Converting data types
df = convert_types(df, columns=['A'], dtypes=[float])
Encoding categorical variables
df = encode_categorical(df, columns=['C'])
Scaling numerical features
df = scale_features(df, columns=['D'], strategy='standard')
print(df)
Contributing We welcome contributions to improve Data-Sanitizer.
Contact For any questions or issues, please contact the package maintainer at goradbj@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data_sanitizer-0.3.0.tar.gz
.
File metadata
- Download URL: data_sanitizer-0.3.0.tar.gz
- Upload date:
- Size: 4.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4062ddacb762b9f16892b910d75f7cc4c3cdd87691f2c7868b9fd8d77347e06 |
|
MD5 | 5c3c5f26a5795f9c8ef32438dce738e0 |
|
BLAKE2b-256 | 30b4f5cca393dc124cbc5ac2638e3d5dcccd2e4d6a8ac403eb41244e91b05c74 |
File details
Details for the file data_sanitizer-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: data_sanitizer-0.3.0-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8d13c999fecd15fbcecfdbdc4fdb099d3cc132d25ff613126d3950ed925f1d1 |
|
MD5 | 91e75869fca1e252d544565a0c7cd4a5 |
|
BLAKE2b-256 | 97e6f57344995f393288127287b692d39890d166c7e6be338a911d6b895b2966 |