A package for automating data quality and integrity checks with optional GPU acceleration using cuDF
Project description
DataPilotChecker
Datapilot is a Python package that automates data quality and integrity checks for your dataset. It performs several checks including missing values, duplicate rows, outliers, data type validation, and range validation. The package uses cuDF for GPU acceleration if a compatible GPU is available, and falls back to Dask for parallel processing otherwise.
Installation
Basic Installation
You can install the package via pip:
pip install data_pilot_checker
Installation with GPU Support
To use GPU acceleration with cuDF, you need to set up a compatible environment. Follow these steps:
Create a conda environment with RAPIDS:
conda create -n rapids-24.06 -c rapidsai -c conda-forge -c nvidia \
rapids=24.06 python=3.11 cuda-version=12.2
Activate the conda environment:
conda activate rapids-24.06
Install DataWhiz in the conda environment:
pip install data_pilot_checker
Check the rapids website for cuDF installation. (https://docs.rapids.ai/install)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data_quality_checker-1.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 686412c990a23fbde3166f7f7449934d1963113f09173a95e02c9282df73f369 |
|
MD5 | d8038dc0613fef2d5009dcc9691b7a22 |
|
BLAKE2b-256 | 4f985fd0f4374b67db35a4ca2a0e1f8608e3bd2ed52bd931146fe89d12384355 |