Skip to main content

Soda SQL library & CLI

Project description

Soda logo

Soda SQL

Data testing, monitoring, and profiling for SQL-accessible data.

License: Apache 2.0 Slack Pypi Soda SQL Build soda-sql



Soda SQL is an open-source command-line tool. It utilizes user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface "bad" data that you can fix to ensure that downstream analysts are using "good" data to make decisions.

Quick start tutorial

Test your data

If your organization uses data to make decisions, you should always be testing your data.

  • When data comes into a system, you should test it.
  • When data is transformed or otherwise manipulated to fit into an app or other database, you should test it.
  • When data is about to be exported, you should test it.
  • Test to make sure data is unique.
  • Test that data is in an expected format, such as date or UUID.
  • Test that data doesn’t exceed limits or acceptable parameters.

Install Soda SQL

Requirements:

  • Python 3.7 or greater
  • Pip 21.0 or greater

Install:

$ pip install soda-sql

Full installation instructions

Use Soda SQL

Install Soda SQL, then complete three basic tasks to start testing your data:

  1. Create and configure a warehouse.yml file so that Soda SQL can connect to your data warehouse.
  2. Create and configure a scan.yml file to define tests for "bad" data. Choose from a list of predefined metrics to define simple tests – is the table empty? – to more complex tests that borrow from SQL query logic.
  3. Run a scan from the command-line to test for "bad" data. Where the tests return “true”, all is well; where a test returns “false”, Soda SQL presents the issues in the command-line output.

scan output

Full configuration instructions

Show me the metrics!

See for yourself! Follow the Quick start tutorial to see Soda SQL up and running in five minutes.

This example scan.yml file defines four tests that Soda SQL runs on data in a table in a data warehouse.

metrics:
    - row_count
    - missing_count
    - missing_percentage
    - values_count
    - values_percentage
    - valid_count
    - valid_percentage
    - invalid_count
    - invalid_percentage
    - min
    - max
    - avg
    - sum
    - min_length
    - max_length
    - avg_length
    - distinct
    - unique_count
    - duplicate_count
    - uniqueness
    - maxs
    - mins
    - frequent_values
    - histogram
columns:
    ID:
        metrics:
            - distinct
            - duplicate_count
        valid_format: uuid
        tests:
            duplicate_count == 0
    CATEGORY:
        missing_values:
            - N/A
            - No category
        tests:
            missing_percentage < 3
    SIZE:
        tests:
            max - min < 20
sql_metrics:
    - sql: |
        SELECT sum(volume) as total_volume_us
        FROM CUSTOMER_TRANSACTIONS
        WHERE country = 'US'
      tests:
        - total_volume_us > 5000
Test Description Outcome
tests: duplicate_count == 0 Tests that there are no duplicate values in the ID column of the table. The test fails if it finds duplicate values.
tests: missing_percentage < 3 Tests that less than 3% of the values in the CATEGORY column match the values set under missing_values. The test fails if more than 3% of the values in the column contain n/a or No category.
tests: max - min < 20 Tests that the difference between the highest value and the lowest value in the SIZE column is less than 20. The test fails if the difference exceeds 20.
tests: total_volume_us > 5000 Tests that the sum total of US transactions in the CUSTOMER_TRANSACTIONS column is greater than 5000. The test fails if the sum total is less than 5000.

When Soda SQL scans the table, it returns the following scan output in your command-line interface.

$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
  | SELECT column_name, data_type, is_nullable
  | FROM information_schema.columns
  | WHERE lower(table_name) = 'customers'
  |   AND table_catalog = 'datasource.database'
  |   AND table_schema = 'datasource.schema'
  - 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
  | SELECT
  |  COUNT(*),
  |  COUNT(CASE WHEN ID IS NULL THEN 1 END),
  |  COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
  |  MIN(LENGTH(ID)),
  |  AVG(LENGTH(ID)),
  |  MAX(LENGTH(ID)),
  | FROM customers
  - 0.557 seconds
row_count : 23543
missing   : 23
invalid   : 0
min_length: 9
avg_length: 9
max_length: 9

...more queries...

47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds

Go further

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soda-sql-2.0.0b27.tar.gz (54.8 kB view details)

Uploaded Source

Built Distribution

soda_sql-2.0.0b27-py3-none-any.whl (84.9 kB view details)

Uploaded Python 3

File details

Details for the file soda-sql-2.0.0b27.tar.gz.

File metadata

  • Download URL: soda-sql-2.0.0b27.tar.gz
  • Upload date:
  • Size: 54.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for soda-sql-2.0.0b27.tar.gz
Algorithm Hash digest
SHA256 6cfb3e37e1db9dba7f68733a32738863dda5921de7d1613850d78d5db503c6f7
MD5 e83f374f7901054a87f88564c639b537
BLAKE2b-256 5d52d822c96275a6f25f250fa8f151dcfb9563bd65a24c471949e92838d5e12f

See more details on using hashes here.

File details

Details for the file soda_sql-2.0.0b27-py3-none-any.whl.

File metadata

  • Download URL: soda_sql-2.0.0b27-py3-none-any.whl
  • Upload date:
  • Size: 84.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for soda_sql-2.0.0b27-py3-none-any.whl
Algorithm Hash digest
SHA256 a52ec4caa856533ae89e4641d22bc291bee0da9dbf07e24aab02416d072d3079
MD5 3e859c3a14df690c43752f3eb7659a18
BLAKE2b-256 75d2fc15580b90adb97d07a556ca4ae9b2b6eb8bf95baf6754a70f5fbfe89e96

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page