Skip to main content

Soda SQL API for PySpark data frame

Project description

Soda Spark


Data testing, monitoring, and profiling for Spark Dataframes.

License: Apache 2.0 Slack Pypi Soda PARK Build soda-spark

Soda Spark is an extension of Soda SQL that allows you to run Soda SQL functionality programmatically on a Spark data frame.

Soda SQL is an open-source command-line tool. It utilizes user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface "bad" data that you can fix to ensure that downstream analysts are using "good" data to make decisions.

Requirements

Soda Spark has the same requirements as soda-sql-spark.

Install

From your shell, execute the following command.

$ pip install soda-spark

Use

From your Python prompt, execute the following commands.

>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
...	   {"id": id, "name": "Paula Landry", "size": 3006},
...	   {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
...   id:
...     valid_format: uuid
...     tests:
...     - invalid_percentage == 0
... sql_metrics:
... - sql: |
...     SELECT sum(size) as total_size_us
...     FROM demodata
...     WHERE country = 'US'
...   tests:
...   - total_size_us > 5000
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements  # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>> scan_result.test_results  # doctest: +ELLIPSIS
[TestResult(test=Test(..., expression='row_count > 0', ...), passed=True, skipped=False, ...)]
>>>

Or, use a scan YAML file

>>> scan_yml = "static/demodata.yml"
>>> scan_result = scan.execute(scan_yml, df)
>>>
>>> scan_result.measurements  # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>

See the scan result object for all attributes and methods.

Or, return Spark data frames:

>>> measurements, test_results, errors = scan.execute(scan_yml, df, as_frames=True)
>>>
>>> measurements  # doctest: +ELLIPSIS
DataFrame[metric: string, column_name: string, value: string, ...]
>>> test_results  # doctest: +ELLIPSIS
DataFrame[test: struct<...>, passed: boolean, skipped: boolean, values: map<string,string>, ...]
>>>

See the _to_data_frame functions in the scan.py to see how the conversion is done.

Send results to Soda cloud

Send the scan result to Soda cloud.

>>> import os
>>> from sodasql.soda_server_client.soda_server_client import SodaServerClient
>>>
>>> soda_server_client = SodaServerClient(
...     host="cloud.soda.io",
...     api_key_id=os.getenv("API_PUBLIC"),
...     api_key_secret=os.getenv("API_PRIVATE"),
... )
>>> scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)
>>>

Understand

Under the hood soda-spark does the following.

  1. Setup the scan
  2. Create (or replace) global temporary view for the Spark data frame
  3. Execute the scan on the temporary view

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soda-spark-0.3.3.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

soda_spark-0.3.3-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file soda-spark-0.3.3.tar.gz.

File metadata

  • Download URL: soda-spark-0.3.3.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for soda-spark-0.3.3.tar.gz
Algorithm Hash digest
SHA256 214eb8d930ef98dc487547636231e4f955613f622476a846f7fa49b073879571
MD5 b2f9d218f382d39db9527cdffe319054
BLAKE2b-256 120799d2bd62260c66cc7022a1fd490e3afb8da938913ebebcb9696f79996fb4

See more details on using hashes here.

File details

Details for the file soda_spark-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: soda_spark-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for soda_spark-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 557c820f0f0a37c45996bae2186b0f94609c2a933313d6aeb606378fa8fd0511
MD5 91a6f89fb174adb5a41412df69dec399
BLAKE2b-256 7d42d81ec19101a563731a5518ffbc9e1c78186b80d065b1f7b7dd8d84dd2106

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page