Soda SQL API for PySpark data frame
Project description
Soda Spark
Soda Spark is an open-source data quality tool for Spark data frames. It is an extension of soda-sql that allows you to run Soda SQL functionality against a Spark data frame.
Install Soda Spark
Install the package using pip.
pip install soda-spark
Use Soda Spark
Intall Soda Spark, then execute a scan with:
>>> import tempfile
>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
... {"id": id, "name": "Paula Landry", "size": 3006},
... {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
... id:
... valid_format: uuid
... tests:
... - invalid_percentage == 0
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
soda-spark-0.1.0.tar.gz
(37.1 kB
view hashes)
Built Distribution
Close
Hashes for soda_spark-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0626aba1857b4adb3f2beeb9eaf5a503ca43729c13c0683b3a1a8f61445e74e |
|
MD5 | c83ce1836d79e114c7680fa97983180d |
|
BLAKE2b-256 | cf618d6509efda5b612f08cf2862f49728524f13e3f41f56eee676e653b35365 |