Skip to main content

Soda SQL library & CLI

Project description

Soda SQL

Data testing and monitoring for SQL accessible data.

What does Soda SQL do?

  • Stops your pipeline if bad data is detected
  • Extracts metrics through SQL
  • Full control over metrics and queries

Why Soda SQL?

To protect against silent data issues for the consumers of your data, it's recommended to check your data before and after every data pipeline job. You will know when bad data enters your pipeline. And you will prevent delivery of bad data to downstream consumers.

How does Soda SQL work?

Soda SQL is a Command Line Interface (CLI) and a Python library to measure and test your data using SQL.

As input, Soda SQL uses Yaml configuration files that include:

  • SQL connection details
  • What metrics to compute
  • What tests to run on the measurements

Based on those configuration files, Soda SQL will perform scans. A scan performs all measurements and runs all tests associated with one table. Typically a scan is executed after new data has arrived. All soda-sql configuration files can be checked into your version control system as part of your pipeline code.

Want to try Soda SQL? Head over to our '5 minute tutorial' and get started straight away!

Show me the money

Simple metrics and tests can be configured in Yaml configuration files called scan.yml. An example of the contents of such a file:

metrics:
    - row_count
    - missing_count
    - missing_percentage
    - values_count
    - values_percentage
    - valid_count
    - valid_percentage
    - invalid_count
    - invalid_percentage
    - min
    - max
    - avg
    - sum
    - min_length
    - max_length
    - avg_length
columns:
    ID:
        metrics:
            - distinct
            - duplicate_count
        valid_format: uuid
        tests:
            duplicates: duplicate_count == 0
    CATEGORY:
        missing_values:
            - N/A
            - No category
        tests:
            missing: missing_percentage < 3
    SIZE:
        metrics:
            -
        tests:
            spread: max - min < 20

Metrics aren't limited to the ones defined by Soda SQL. You can create your own custom SQL metric definitions with a simple yml file.

metrics:
    - total_volume_us
sql: |
    SELECT sum(volume) as total_volume_us
    FROM CUSTOMER_TRANSACTIONS
    WHERE country = 'US'
tests:
    - total_volume_us > 5000

Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:

$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
  | SELECT column_name, data_type, is_nullable
  | FROM information_schema.columns
  | WHERE lower(table_name) = 'customers'
  |   AND table_catalog = 'datasource.database'
  |   AND table_schema = 'datasource.schema'
  - 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
  | SELECT
  |  COUNT(*),
  |  COUNT(CASE WHEN ID IS NULL THEN 1 END),
  |  COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
  |  MIN(LENGTH(ID)),
  |  AVG(LENGTH(ID)),
  |  MAX(LENGTH(ID)),
  | FROM customers
  - 0.557 seconds
row_count : 23543
missing   : 23
invalid   : 0
min_length: 9
avg_length: 9
max_length: 9

...more queries...

47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds

The next step is to add Soda SQL scans in your favorite data pipeline orchestration solution like:

  • Airflow
  • AWS Glue
  • Prefect
  • Dagster
  • Fivetran
  • Matillion
  • Luigi

If you like the goals of this project, encourage us! Star soda-sql on GitHub

Next, head over to our '5 minute tutorial' and get your first project going!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soda-sql-2.0.0b2.tar.gz (32.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

soda_sql-2.0.0b2-py3-none-any.whl (58.7 kB view details)

Uploaded Python 3

File details

Details for the file soda-sql-2.0.0b2.tar.gz.

File metadata

  • Download URL: soda-sql-2.0.0b2.tar.gz
  • Upload date:
  • Size: 32.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.7

File hashes

Hashes for soda-sql-2.0.0b2.tar.gz
Algorithm Hash digest
SHA256 17757f1c73a469e4d753b9418104aacab16648bcb4e02a1e599f0398b1f97175
MD5 c6ff9ea0b1ef942912481119604ccedb
BLAKE2b-256 c386b3d8c7b23c208febb713900740b8c2da676acc174788093ad236731f61cb

See more details on using hashes here.

File details

Details for the file soda_sql-2.0.0b2-py3-none-any.whl.

File metadata

  • Download URL: soda_sql-2.0.0b2-py3-none-any.whl
  • Upload date:
  • Size: 58.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.7

File hashes

Hashes for soda_sql-2.0.0b2-py3-none-any.whl
Algorithm Hash digest
SHA256 521ce39ab8e8fc075ea4504b3f61715a2f1c89d95b6aebe5564db3ce45f18270
MD5 8d0439df0c88dce2c39cca3603540d3a
BLAKE2b-256 e3393b7e7d79f1906fd89a50845b448422a9892e3dc2a33efe61f982baa9b330

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page