Skip to main content

A Python library to validate, profile, and clean datasets (CSV, Excel, Parquet) for machine learning workflows.

Project description

Aydie Banner

aydie-dataset-cleaner

PyPI version License: MIT Python Versions

A powerful Python toolkit for validating, cleaning, and preparing datasets for machine learning.


aydie-dataset-cleaner is a comprehensive library designed to streamline the data preprocessing pipeline. It provides a structured, automated, and repeatable workflow for identifying and fixing common data quality issues in CSV, Excel, and Parquet files, ensuring your data is reliable and ready for analysis.

This library is part of the Aydie family of developer tools.

Why aydie-dataset-cleaner?

In any data-driven project, 80% of the work is data preparation. Messy, inconsistent data leads to unreliable models and flawed insights. aydie-dataset-cleaner automates the tedious parts of this process, allowing you to focus on building great models.

  • Reliability & Consistency: Apply a standardized set of validation rules to every dataset, ensuring consistent data quality across all your projects.
  • Increased Efficiency: Stop writing boilerplate data cleaning code. With a few lines of code, you can load, validate, clean, and report on any dataset.
  • Clear Insights: Generate beautiful HTML and machine-readable JSON reports to understand exactly what's wrong with your data and what was done to fix it.
  • Extensible & Controllable: You have full control over the cleaning process, from choosing outlier detection methods to defining how missing values are handled.

Core Features

  • Versatile File Loading: Seamlessly load data from CSV, Excel (.xlsx, .xls), and Parquet files into pandas DataFrames.
  • Comprehensive Validation: Run a full suite of checks for common issues like missing values, duplicate rows, and inconsistent data types.
  • Advanced Outlier Detection: Identify outliers in your numerical data using multiple statistical methods:
    • iqr (Interquartile Range)
    • zscore (Standard Score)
    • mad (Median Absolute Deviation)
    • percentile (Trimming extreme values)
  • Automated Cleaning: Systematically fix issues found during validation, including filling missing data, removing duplicates, and correcting data types.
  • Rich Reporting: Automatically generate a detailed HTML report for easy visual inspection or a JSON report for programmatic use.

Installation

You can install aydie-dataset-cleaner directly from PyPI:

pip install aydie-dataset-cleaner

Full Workflow Example

Here’s a quick example demonstrating the end-to-end workflow.

import pandas as pd
import numpy as np
from aydie_dataset_cleaner import file_loader, validator, cleaner, reporter

# --- 1. SETUP: Create a messy DataFrame for demonstration ---
data = {
    'product_id': ['A101', 'A102', 'A103', 'A104', 'A101'],
    'price': [1200.50, 75.00, np.nan, 250.75, 1200.50],
    'stock_quantity': [15, 200, 30, np.nan, 15],
    'rating': [4.5, 4.0, 3.5, 4.8, 4.5],
    'region': ['USA', 'EU', 'USA', 'USA', 'UK_typo']
}
dirty_df = pd.DataFrame(data)
print("--- Original Messy DataFrame ---")
print(dirty_df)

# --- 2. VALIDATE: Run all validation checks ---
print("\n--- Validating Dataset ---")
v = validator.DatasetValidator(dirty_df)
validation_results = v.run_all_checks()

# --- 3. REPORT: Generate a human-readable HTML report ---
print("\n--- Generating Report ---")
r = reporter.ReportGenerator(validation_results)
r.to_html('validation_report.html')
print("Validation report saved to 'validation_report.html'")

# --- 4. CLEAN: Clean the dataset based on the validation results ---
print("\n--- Cleaning Dataset ---")
c = cleaner.DatasetCleaner(dirty_df, validation_results)
cleaned_df = c.clean_dataset(missing_value_strategy='median')

# --- 5. VERIFY: Display the cleaned data ---
print("\n--- Cleaned DataFrame ---")
print(cleaned_df)

Module-by-Module Examples

file_loader: Loading Your Data

The load_dataset function automatically detects the file type and loads it into a DataFrame.

from aydie_dataset_cleaner import file_loader

# Load a CSV file with specific options
df_csv = file_loader.load_dataset('data.csv', sep=';')

# Load a specific sheet from an Excel file
df_excel = file_loader.load_dataset('data.xlsx', sheet_name='SalesData')

# Load a Parquet file
df_parquet = file_loader.load_dataset('data.parquet')

validator: Finding Data Issues

The DatasetValidator is the core of the library, identifying all potential issues.

from aydie_dataset_cleaner import validator

# Assume 'df' is your loaded DataFrame
v = validator.DatasetValidator(df)

# Check for missing values
missing_report = v.check_missing_values()
# {'price': {'count': 1, 'percentage': 20.0}}

# Check for duplicate rows
duplicate_report = v.check_duplicate_rows()
# {'count': 1}

# Check for outliers using different methods
# The run_all_checks() method runs 'iqr' by default
outliers_iqr = v.check_outliers(method='iqr', multiplier=1.5)
outliers_zscore = v.check_outliers(method='zscore', threshold=3)
outliers_mad = v.check_outliers(method='mad', threshold=3.5)

# Run all checks at once
full_report = v.run_all_checks()

cleaner: Fixing Data Issues

The DatasetCleaner uses the report from the validator to apply fixes.

from aydie_dataset_cleaner import cleaner

# Assume 'df' is your DataFrame and 'validation_results' is your report
c = cleaner.DatasetCleaner(df, validation_results)

# Handle missing values with different strategies
df_mean_filled = c.handle_missing_values(strategy='mean')
df_median_filled = c.handle_missing_values(strategy='median')
df_mode_filled = c.handle_missing_values(strategy='mode')

# Remove duplicate rows
df_no_duplicates = c.handle_duplicate_rows()

# Run the entire cleaning pipeline
cleaned_df = c.clean_dataset(missing_value_strategy='mean')

reporter: Generating Reports

The ReportGenerator creates clean, shareable reports from the validation results.

from aydie_dataset_cleaner import reporter

# Assume 'validation_results' is your report
r = reporter.ReportGenerator(validation_results)

# Create a machine-readable JSON file
r.to_json('report.json')

# Create a beautiful, human-readable HTML file
r.to_html('report.html')

Contributing

Contributions are welcome! If you have an idea for a new feature, find a bug, or want to improve the documentation, please open an issue or submit a pull request on our GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Connect with Me

GitHub Source Code Website LinkedIn X Instagram YouTube GitLab Email

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aydie_dataset_cleaner-0.0.1b0.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aydie_dataset_cleaner-0.0.1b0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file aydie_dataset_cleaner-0.0.1b0.tar.gz.

File metadata

  • Download URL: aydie_dataset_cleaner-0.0.1b0.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for aydie_dataset_cleaner-0.0.1b0.tar.gz
Algorithm Hash digest
SHA256 fb5d140579f0b2a16ec3ba3bb6ccc06f96956838f4fd69bf49d8525b3e8320e0
MD5 e87964201600a81e84ffb05d620cc7a7
BLAKE2b-256 0e55583de2ae8b1c836577ad8564a8ca39ccd278efeb0006e468af455d8fb03e

See more details on using hashes here.

File details

Details for the file aydie_dataset_cleaner-0.0.1b0-py3-none-any.whl.

File metadata

File hashes

Hashes for aydie_dataset_cleaner-0.0.1b0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b09c8d8a4fca1e3d81bfe1b060fd13113debe49e595eaedf7a863c42b2917b3
MD5 8e0671700d21ee32ea4fdbc951ad8dc1
BLAKE2b-256 7b6097a9bab51cf0a10ba0dd27aa99ac477878238783657c856a5e03e4ef6e5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page