Soda SQL API for PySpark data frame

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

Soda Spark

Soda Spark is an extension of Soda SQL that allows you to run Soda SQL functionality programmatically on a Spark data frame.

Requirements

Soda Spark has the same requirements as soda-sql-spark.

Install

From your command-line interface tool, execute the following command.

$ pip install soda-spark

Use

From your Python prompt, execute the following commands.

>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
...	   {"id": id, "name": "Paula Landry", "size": 3006},
...	   {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
...   id:
...     valid_format: uuid
...     tests:
...     - invalid_percentage == 0
... sql_metrics:
... - sql: |
...     SELECT sum(size) as total_size_us
...     FROM demodata
...     WHERE country = 'US'
...   tests:
...   - total_size_us > 5000
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements  # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>> scan_result.test_results  # doctest: +ELLIPSIS
[TestResult(test=Test(..., expression='row_count > 0', ...), passed=True, skipped=False, ...)]
>>>

Or, use a scan YAML file

>>> scan_yml = "static/demodata.yml"
>>> scan_result = scan.execute(scan_yml, df)
>>>
>>> scan_result.measurements  # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>

See the scan result object for all attributes and methods.

Send results to Soda cloud

Send the scan result to Soda cloud.

>>> import os
>>> from sodasql.soda_server_client.soda_server_client import SodaServerClient
>>>
>>> soda_server_client = SodaServerClient(
...     host="cloud.soda.io",
...     api_key_id=os.getenv("API_PUBLIC"),
...     api_key_secret=os.getenv("API_PRIVATE"),
... )
>>> scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)
>>>

Understand

Under the hood soda-spark does the following.

Setup the scan
- Use the Spark dialect
- Use Spark session as warehouse connection
Create (or replace) global temporary view for the Spark data frame
Execute the scan on the temporary view

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.3.3

May 11, 2022

0.3.2

Apr 8, 2022

0.3.1

Dec 27, 2021

0.3.0

Dec 22, 2021

0.2.3

Nov 18, 2021

0.2.2

Nov 16, 2021

This version

0.2.1

Sep 23, 2021

0.2.0

Sep 9, 2021

0.1.1

Sep 7, 2021

0.1.0

Sep 6, 2021

0.0.0

Sep 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soda-spark-0.2.1.tar.gz (39.3 kB view hashes)

Uploaded Sep 23, 2021 Source

Built Distribution

soda_spark-0.2.1-py3-none-any.whl (9.0 kB view hashes)

Uploaded Sep 23, 2021 Python 3

Hashes for soda-spark-0.2.1.tar.gz

Hashes for soda-spark-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`245e5396252dc68eadc414dcca4efa0060bfdb9e0ad87c8099abd10be9e61798`
MD5	`aa233912e3052dc9ee19f1be3db859d5`
BLAKE2b-256	`89f78f9537ce332fad336647589c00a1dd8a101726b802e9c23860a8c06a5b43`

Hashes for soda_spark-0.2.1-py3-none-any.whl

Hashes for soda_spark-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e380a26d25eb15f89820c0202b5bfe5cdaa6b9f6725c2163fa1ff9ee85aa3934`
MD5	`6b0ebfad2568f5b94ac369edd32cd53f`
BLAKE2b-256	`d3a9beb0b8d58a4f71b6044b44044e957f8ec041c676cb764f7df4d530364b86`