Skip to main content

Spark data quality check tool

Project description

spark-profiling

CI Build

Requirements

  • Java 8+
  • Apache Spark 3.0+

Dependencies

Filename Requirements
requirements.txt Package requirements
requirements-dev.txt Requirements for development

Usage

Use GeneralProfiler

from pyspark.sql import SparkSession
from data_quality_check.profiler.general_profiler import GeneralProfiler

spark = SparkSession.builder.appName("SparkProfilingApp").enableHiveSupport().getOrCreate()
data = [{'name': 'Alice', 'age': 1}]
df = spark.createDataFrame(data)

result_df = GeneralProfiler(spark, df).run(return_type='dataframe')
result_df.show()

Test

PYTHONPATH=./src pytest tests/*

Build

python setup.py sdist bdist_wheel
twine check dist/*

Publish

twine upload --repository-url https://test.pypi.org/legacy/ dist/*
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-quality-check-0.0.2.tar.gz (6.3 kB view hashes)

Uploaded Source

Built Distribution

data_quality_check-0.0.2-py3-none-any.whl (7.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page