Skip to main content

A package for automating data quality and integrity checks with optional GPU acceleration using cuDF

Project description

DataWhiz

DataWhiz is a Python package that automates data quality and integrity checks for your dataset. It performs several checks including missing values, duplicate rows, outliers, data type validation, and range validation. The package uses cuDF for GPU acceleration if a compatible GPU is available, and falls back to Dask for parallel processing otherwise.

Installation

Basic Installation

You can install the package via pip:

pip install datawhiz

Installation with GPU Support

To use GPU acceleration with cuDF, you need to set up a compatible environment. Follow these steps:

Create a conda environment with RAPIDS:

conda create -n rapids-24.06 -c rapidsai -c conda-forge -c nvidia \
    rapids=24.06 python=3.11 cuda-version=12.2

Activate the conda environment:

conda activate rapids-24.06

Install DataWhiz in the conda environment:

pip install datawhiz

Check the rapids website for cuDF installation. (https://docs.rapids.ai/install)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_pilot_checker-1.tar.gz (3.9 kB view hashes)

Uploaded Source

Built Distribution

data_pilot_checker-1-py3-none-any.whl (3.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page