Skip to main content

Soda SQL library & CLI

Project description

Soda logo

Soda SQL

Data testing, monitoring and profiling for SQL accessible data.

License: Apache 2.0 Slack Pypi Soda SQL Build soda-sql

What does Soda SQL do?

Soda SQL allows you to

  • Stop your pipeline when bad data is detected
  • Extract metrics and column profiles through super efficient SQL
  • Full control over metrics and queries through declarative config files

Why Soda SQL?

To protect against silent data issues for the consumers of your data, it's best-practice to profile and test your data:

  • as it lands in your warehouse,
  • after every important data processing step
  • right before consumption.

This way you will prevent delivery of bad data to downstream consumers. You will spend less time firefighting and gain a better reputation.

How does Soda SQL work?

Soda SQL is a Command Line Interface (CLI) and a Python library to measure and test your data using SQL.

As input, Soda SQL uses YAML configuration files that include:

  • SQL connection details
  • What metrics to compute
  • What tests to run on the measurements

Based on those configuration files, Soda SQL will perform scans. A scan performs all measurements and runs all tests associated with one table. Typically a scan is executed after new data has arrived. All soda-sql configuration files can be checked into your version control system as part of your pipeline code.

Want to try Soda SQL? Head over to our 'Quick start tutorial' and get started straight away!

"Show me the metrics"

Let's walk through an example. Simple metrics and tests can be configured in scan YAML configuration files. An example of the contents of such a file:

metrics:
    - row_count
    - missing_count
    - missing_percentage
    - values_count
    - values_percentage
    - valid_count
    - valid_percentage
    - invalid_count
    - invalid_percentage
    - min
    - max
    - avg
    - sum
    - min_length
    - max_length
    - avg_length
    - distinct
    - unique_count
    - duplicate_count
    - uniqueness
    - maxs
    - mins
    - frequent_values
    - histogram
columns:
    ID:
        metrics:
            - distinct
            - duplicate_count
        valid_format: uuid
        tests:
            duplicate_count == 0
    CATEGORY:
        missing_values:
            - N/A
            - No category
        tests:
            missing_percentage < 3
    SIZE:
        tests:
            max - min < 20
sql_metrics:
    - sql: |
        SELECT sum(volume) as total_volume_us
        FROM CUSTOMER_TRANSACTIONS
        WHERE country = 'US'
      tests:
        - total_volume_us > 5000

Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:

$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
  | SELECT column_name, data_type, is_nullable
  | FROM information_schema.columns
  | WHERE lower(table_name) = 'customers'
  |   AND table_catalog = 'datasource.database'
  |   AND table_schema = 'datasource.schema'
  - 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
  | SELECT
  |  COUNT(*),
  |  COUNT(CASE WHEN ID IS NULL THEN 1 END),
  |  COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
  |  MIN(LENGTH(ID)),
  |  AVG(LENGTH(ID)),
  |  MAX(LENGTH(ID)),
  | FROM customers
  - 0.557 seconds
row_count : 23543
missing   : 23
invalid   : 0
min_length: 9
avg_length: 9
max_length: 9

...more queries...

47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds

The next step is to add Soda SQL scans in your favorite data pipeline orchestration solution like:

  • Airflow
  • AWS Glue
  • Prefect
  • Dagster
  • Fivetran
  • Matillion
  • Luigi

If you like the goals of this project, encourage us! Star sodadata/soda-sql on Github.

Next, head over to our 'Quick start tutorial' and get your first project going!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soda-sql-core-2.1.0b2.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

soda_sql_core-2.1.0b2-py3-none-any.whl (62.3 kB view details)

Uploaded Python 3

File details

Details for the file soda-sql-core-2.1.0b2.tar.gz.

File metadata

  • Download URL: soda-sql-core-2.1.0b2.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.9

File hashes

Hashes for soda-sql-core-2.1.0b2.tar.gz
Algorithm Hash digest
SHA256 43ee0d4f9c63ed7dd734ce28066a938c996a095d231eea271af7a9dbafe52304
MD5 e85530f2f3fa2cfa83b3b047e1f31a6c
BLAKE2b-256 f8af6c4f57dc732c0194f0f18eb3a7e3ea4833be5f1f1d2376da1715af9d07ba

See more details on using hashes here.

File details

Details for the file soda_sql_core-2.1.0b2-py3-none-any.whl.

File metadata

  • Download URL: soda_sql_core-2.1.0b2-py3-none-any.whl
  • Upload date:
  • Size: 62.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.9

File hashes

Hashes for soda_sql_core-2.1.0b2-py3-none-any.whl
Algorithm Hash digest
SHA256 fde50e1bc41e27b92c04fe93aa14229261a44103c0d537fba3793687ea4d370a
MD5 44b47281cf39c3ea5a89b13816c52e3f
BLAKE2b-256 c7d7f48fc5e69ceb935c26708f38b06e410c3854e50aadecd3b93aba1c87a1f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page