Soda SQL API for PySpark data frame
Project description
Soda Spark
Soda Spark is an open-source data quality tool for Spark data frames. It is an extension of soda-sql that allows you to run Soda SQL functionality against a Spark data frame.
Install Soda Spark
Install the package using pip.
pip install soda-spark
Use Soda Spark
Intall Soda Spark, then execute a scan with:
>>> import tempfile
>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
... {"id": id, "name": "Paula Landry", "size": 3006},
... {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
... id:
... valid_format: uuid
... tests:
... - invalid_percentage == 0
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
soda-spark-0.0.0.tar.gz
(7.8 kB
view hashes)
Built Distribution
Close
Hashes for soda_spark-0.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4a55da3b68b485b70e710f67b12d6f4dbe3e0cf24eb7af78b25d6823f28ec65 |
|
MD5 | f3217dde1b176ab97a8820bfddb92448 |
|
BLAKE2b-256 | edf12f3c283865e9820d7c993853bdc7d8c895e171651ec20425d8c0e5517b3e |