Soda SQL library & CLI

These details have not been verified by PyPI

Project description

Soda logo

Soda SQL

Data testing, monitoring and profiling for SQL accessible data.

What does Soda SQL do?

Soda SQL allows you to

Stop your pipeline when bad data is detected
Extract metrics and column profiles through super efficient SQL
Full control over metrics and queries through declarative config files

Why Soda SQL?

To protect against silent data issues for the consumers of your data, it's best-practice to profile and test your data:

as it lands in your warehouse,
after every important data processing step
right before consumption.

This way you will prevent delivery of bad data to downstream consumers. You will spend less time firefighting and gain a better reputation.

How does Soda SQL work?

Soda SQL is a Command Line Interface (CLI) and a Python library to measure and test your data using SQL.

As input, Soda SQL uses YAML configuration files that include:

SQL connection details
What metrics to compute
What tests to run on the measurements

Based on those configuration files, Soda SQL will perform scans. A scan performs all measurements and runs all tests associated with one table. Typically a scan is executed after new data has arrived. All soda-sql configuration files can be checked into your version control system as part of your pipeline code.

Want to try Soda SQL? Head over to our 'Quick start tutorial' and get started straight away!

"Show me the metrics"

Let's walk through an example. Simple metrics and tests can be configured in scan YAML configuration files. An example of the contents of such a file:

metrics:
    - row_count
    - missing_count
    - missing_percentage
    - values_count
    - values_percentage
    - valid_count
    - valid_percentage
    - invalid_count
    - invalid_percentage
    - min
    - max
    - avg
    - sum
    - min_length
    - max_length
    - avg_length
    - distinct
    - unique_count
    - duplicate_count
    - uniqueness
    - maxs
    - mins
    - frequent_values
    - histogram
columns:
    ID:
        metrics:
            - distinct
            - duplicate_count
        valid_format: uuid
        tests:
            duplicate_count == 0
    CATEGORY:
        missing_values:
            - N/A
            - No category
        tests:
            missing_percentage < 3
    SIZE:
        tests:
            max - min < 20
sql_metrics:
    - sql: |
        SELECT sum(volume) as total_volume_us
        FROM CUSTOMER_TRANSACTIONS
        WHERE country = 'US'
      tests:
        - total_volume_us > 5000

Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:

$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
  | SELECT column_name, data_type, is_nullable
  | FROM information_schema.columns
  | WHERE lower(table_name) = 'customers'
  |   AND table_catalog = 'datasource.database'
  |   AND table_schema = 'datasource.schema'
  - 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
  | SELECT
  |  COUNT(*),
  |  COUNT(CASE WHEN ID IS NULL THEN 1 END),
  |  COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
  |  MIN(LENGTH(ID)),
  |  AVG(LENGTH(ID)),
  |  MAX(LENGTH(ID)),
  | FROM customers
  - 0.557 seconds
row_count : 23543
missing   : 23
invalid   : 0
min_length: 9
avg_length: 9
max_length: 9

...more queries...

47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds

The next step is to add Soda SQL scans in your favorite data pipeline orchestration solution like:

Airflow
AWS Glue
Prefect
Dagster
Fivetran
Matillion
Luigi

If you like the goals of this project, encourage us! Star sodadata/soda-sql on Github.

Next, head over to our 'Quick start tutorial' and get your first project going!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.2.2

Aug 11, 2022

2.2.1

May 26, 2022

2.1.9

Mar 31, 2022

2.1.8

Mar 25, 2022

2.1.6

Mar 7, 2022

2.1.5

Mar 1, 2022

2.1.4

Feb 24, 2022

2.1.3

Jan 14, 2022

2.1.2

Dec 28, 2021

2.1.1

Dec 14, 2021

2.1.0

Dec 2, 2021

2.1.0b22 pre-release

Nov 23, 2021

2.1.0b20 pre-release

Nov 9, 2021

2.1.0b18 pre-release

Oct 5, 2021

2.1.0b17 pre-release

Sep 21, 2021

2.1.0b16 pre-release

Sep 7, 2021

2.1.0b15 pre-release

Sep 1, 2021

2.1.0b14 pre-release

Aug 18, 2021

2.1.0b13 pre-release

Aug 3, 2021

2.1.0b12 pre-release

Jul 23, 2021

2.1.0b11 pre-release

Jul 6, 2021

2.1.0b10 pre-release

Jun 15, 2021

2.1.0b9 pre-release

Jun 8, 2021

2.1.0b8 pre-release

Jun 2, 2021

2.1.0b7 pre-release

May 25, 2021

2.1.0b6 pre-release

May 18, 2021

2.1.0b5 pre-release

May 11, 2021

2.1.0b4 pre-release

May 4, 2021

2.1.0b3 pre-release

Apr 28, 2021

2.1.0b2 pre-release

Apr 20, 2021

2.1.0b1 pre-release

Apr 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soda-sql-core-2.2.2.tar.gz (60.5 kB view details)

Uploaded Aug 11, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

soda_sql_core-2.2.2-py3-none-any.whl (88.8 kB view details)

Uploaded Aug 11, 2022 Python 3

File details

Details for the file soda-sql-core-2.2.2.tar.gz.

File metadata

Download URL: soda-sql-core-2.2.2.tar.gz
Upload date: Aug 11, 2022
Size: 60.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for soda-sql-core-2.2.2.tar.gz
Algorithm	Hash digest
SHA256	`5bf49ad7545c7d8173edb44e21af644c295e20e92613e423abda6cfe91337753`
MD5	`978d3398f6f03c9d505cda51c94db1c6`
BLAKE2b-256	`bb5b16a61b4e206e03f78ecea9e8bdff5f96601d119cb891777156f997b51587`

See more details on using hashes here.

File details

Details for the file soda_sql_core-2.2.2-py3-none-any.whl.

File metadata

Download URL: soda_sql_core-2.2.2-py3-none-any.whl
Upload date: Aug 11, 2022
Size: 88.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for soda_sql_core-2.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9a7a11866e4bc9ec07c3ecfe447cc67d070ee8279ac78958a0487c6108a45f10`
MD5	`788699746d058bb980cf38900a5d5362`
BLAKE2b-256	`1cff9b96e93e3f7db250c464c833329b39ffbf06633f5ef242d8e0aa431cd06b`

See more details on using hashes here.

soda-sql-core 2.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Soda SQL

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes