Spark data quality check tool
Project description
spark-profiling
Requirements
- Java 8+
- Apache Spark 3.0+
Dependencies
Filename | Requirements |
---|---|
requirements.txt | Package requirements |
requirements-dev.txt | Requirements for development |
Usage
Use GeneralProfiler
from pyspark.sql import SparkSession
from data_quality_check.profiler.general_profiler import GeneralProfiler
spark = SparkSession.builder.appName("SparkProfilingApp").enableHiveSupport().getOrCreate()
data = [{'name': 'Alice', 'age': 1}]
df = spark.createDataFrame(data)
result_df = GeneralProfiler(spark, df).run(return_type='dataframe')
result_df.show()
Test
PYTHONPATH=./src pytest tests/*
Build
python setup.py sdist bdist_wheel
twine check dist/*
Publish
twine upload --repository-url https://test.pypi.org/legacy/ dist/*
twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for data_quality_check-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c88fa9b42f3abe068ea6d12bc542642656f4b905ae0dc7e7104c5f0e10c87822 |
|
MD5 | 268362b6f04e1c6f5f59f4c30d29dbc8 |
|
BLAKE2b-256 | b8aad0dafe8052740eda8e3f4fd8a9fa70eb659755b1b98b73e0060a308ce78e |