Soda SQL API for PySpark data frame
Project description
Soda Spark
Soda Spark is an extension of Soda SQL that allows you to run Soda SQL functionality programmatically on a Spark data frame.
Requirements
Soda Spark has the same requirements as
soda-sql-spark
.
Install
From your command-line interface tool, execute the following command.
$ pip install soda-spark
Use
From your Python prompt, execute the following commands.
>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
... {"id": id, "name": "Paula Landry", "size": 3006},
... {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
... id:
... valid_format: uuid
... tests:
... - invalid_percentage == 0
... sql_metrics:
... - sql: |
... SELECT sum(size) as total_size_us
... FROM demodata
... WHERE country = 'US'
... tests:
... - total_size_us > 5000
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>> scan_result.test_results # doctest: +ELLIPSIS
[TestResult(test=Test(..., expression='row_count > 0', ...), passed=True, skipped=False, ...)]
>>>
Or, use a scan YAML file
>>> scan_yml = "static/demodata.yml"
>>> scan_result = scan.execute(scan_yml, df)
>>>
>>> scan_result.measurements # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>
See the scan result object for all attributes and methods.
Send results to Soda cloud
Send the scan result to Soda cloud.
>>> import os
>>> from sodasql.soda_server_client.soda_server_client import SodaServerClient
>>>
>>> soda_server_client = SodaServerClient(
... host="cloud.soda.io",
... api_key_id=os.getenv("API_PUBLIC"),
... api_key_secret=os.getenv("API_PRIVATE"),
... )
>>> scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)
>>>
Understand
Under the hood soda-spark
does the following.
- Setup the scan
- Use the Spark dialect
- Use Spark session as warehouse connection
- Create (or replace) global temporary view for the Spark data frame
- Execute the scan on the temporary view
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
soda-spark-0.2.1.tar.gz
(39.3 kB
view hashes)
Built Distribution
Close
Hashes for soda_spark-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e380a26d25eb15f89820c0202b5bfe5cdaa6b9f6725c2163fa1ff9ee85aa3934 |
|
MD5 | 6b0ebfad2568f5b94ac369edd32cd53f |
|
BLAKE2b-256 | d3a9beb0b8d58a4f71b6044b44044e957f8ec041c676cb764f7df4d530364b86 |