Soda SQL API for PySpark data frame

These details have not been verified by PyPI

Project links

Project description

Soda Spark

Data testing, monitoring, and profiling for Spark Dataframes.

Soda Spark is an extension of Soda SQL that allows you to run Soda SQL functionality programmatically on a Spark data frame.

Soda SQL is an open-source command-line tool. It utilizes user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface "bad" data that you can fix to ensure that downstream analysts are using "good" data to make decisions.

Requirements

Soda Spark has the same requirements as soda-sql-spark.

Install

From your shell, execute the following command.

$ pip install soda-spark

Use

From your Python prompt, execute the following commands.

>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
...	   {"id": id, "name": "Paula Landry", "size": 3006},
...	   {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
...   id:
...     valid_format: uuid
...     tests:
...     - invalid_percentage == 0
... sql_metrics:
... - sql: |
...     SELECT sum(size) as total_size_us
...     FROM demodata
...     WHERE country = 'US'
...   tests:
...   - total_size_us > 5000
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements  # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>> scan_result.test_results  # doctest: +ELLIPSIS
[TestResult(test=Test(..., expression='row_count > 0', ...), passed=True, skipped=False, ...)]
>>>

Or, use a scan YAML file

>>> scan_yml = "static/demodata.yml"
>>> scan_result = scan.execute(scan_yml, df)
>>>
>>> scan_result.measurements  # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>

See the scan result object for all attributes and methods.

Or, return Spark data frames:

>>> measurements, test_results, errors = scan.execute(scan_yml, df, as_frames=True)
>>>
>>> measurements  # doctest: +ELLIPSIS
DataFrame[metric: string, column_name: string, value: string, ...]
>>> test_results  # doctest: +ELLIPSIS
DataFrame[test: struct<...>, passed: boolean, skipped: boolean, values: map<string,string>, ...]
>>>

See the _to_data_frame functions in the scan.py to see how the conversion is done.

Send results to Soda cloud

Send the scan result to Soda cloud.

>>> import os
>>> from sodasql.soda_server_client.soda_server_client import SodaServerClient
>>>
>>> soda_server_client = SodaServerClient(
...     host="cloud.soda.io",
...     api_key_id=os.getenv("API_PUBLIC"),
...     api_key_secret=os.getenv("API_PRIVATE"),
... )
>>> scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)
>>>

Understand

Under the hood soda-spark does the following.

Setup the scan
- Use the Spark dialect
- Use Spark session as warehouse connection
Create (or replace) global temporary view for the Spark data frame
Execute the scan on the temporary view

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.3

May 11, 2022

0.3.2

Apr 8, 2022

0.3.1

Dec 27, 2021

0.3.0

Dec 22, 2021

0.2.3

Nov 18, 2021

0.2.2

Nov 16, 2021

0.2.1

Sep 23, 2021

0.2.0

Sep 9, 2021

0.1.1

Sep 7, 2021

0.1.0

Sep 6, 2021

0.0.0

Sep 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soda-spark-0.3.3.tar.gz (41.8 kB view details)

Uploaded May 11, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

soda_spark-0.3.3-py3-none-any.whl (10.6 kB view details)

Uploaded May 11, 2022 Python 3

File details

Details for the file soda-spark-0.3.3.tar.gz.

File metadata

Download URL: soda-spark-0.3.3.tar.gz
Upload date: May 11, 2022
Size: 41.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for soda-spark-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`214eb8d930ef98dc487547636231e4f955613f622476a846f7fa49b073879571`
MD5	`b2f9d218f382d39db9527cdffe319054`
BLAKE2b-256	`120799d2bd62260c66cc7022a1fd490e3afb8da938913ebebcb9696f79996fb4`

See more details on using hashes here.

File details

Details for the file soda_spark-0.3.3-py3-none-any.whl.

File metadata

Download URL: soda_spark-0.3.3-py3-none-any.whl
Upload date: May 11, 2022
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for soda_spark-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`557c820f0f0a37c45996bae2186b0f94609c2a933313d6aeb606378fa8fd0511`
MD5	`91a6f89fb174adb5a41412df69dec399`
BLAKE2b-256	`7d42d81ec19101a563731a5518ffbc9e1c78186b80d065b1f7b7dd8d84dd2106`

See more details on using hashes here.

soda-spark 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Soda Spark

Requirements

Install

Use

Send results to Soda cloud

Understand

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes