Soda SQL library & CLI
Project description
Soda SQL
Data testing, monitoring, and profiling for SQL-accessible data.
Soda SQL is an open-source command-line tool. It utilizes user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface "bad" data that you can fix to ensure that downstream analysts are using "good" data to make decisions.
Test your data
If your organization uses data to make decisions, you should always be testing your data.
- When data comes into a system, you should test it.
- When data is transformed or otherwise manipulated to fit into an app or other database, you should test it.
- When data is about to be exported, you should test it.
- Test to make sure data is unique.
- Test that data is in an expected format, such as date or UUID.
- Test that data doesn’t exceed limits or acceptable parameters.
Install Soda SQL
Requirements:
- Python 3.7 or greater
- Pip 21.0 or greater
Install:
$ pip install soda-sql
Full installation instructions
Use Soda SQL
Install Soda SQL, then complete three basic tasks to start testing your data:
- Create and configure a
warehouse.yml
file so that Soda SQL can connect to your data warehouse. - Create and configure a
scan.yml
file to define tests for "bad" data. Choose from a list of predefined metrics to define simple tests – is the table empty? – to more complex tests that borrow from SQL query logic. - Run a scan from the command-line to test for "bad" data. Where the tests return “true”, all is well; where a test returns “false”, Soda SQL presents the issues in the command-line output.
Full configuration instructions
Show me the metrics!
See for yourself! Follow the Quick start tutorial to see Soda SQL up and running in five minutes.
This example scan.yml
file defines four tests that Soda SQL runs on data in a table in a data warehouse.
metrics:
- row_count
- missing_count
- missing_percentage
- values_count
- values_percentage
- valid_count
- valid_percentage
- invalid_count
- invalid_percentage
- min
- max
- avg
- sum
- min_length
- max_length
- avg_length
- distinct
- unique_count
- duplicate_count
- uniqueness
- maxs
- mins
- frequent_values
- histogram
columns:
ID:
metrics:
- distinct
- duplicate_count
valid_format: uuid
tests:
duplicate_count == 0
CATEGORY:
missing_values:
- N/A
- No category
tests:
missing_percentage < 3
SIZE:
tests:
max - min < 20
sql_metrics:
- sql: |
SELECT sum(volume) as total_volume_us
FROM CUSTOMER_TRANSACTIONS
WHERE country = 'US'
tests:
- total_volume_us > 5000
Test | Description | Outcome |
---|---|---|
tests: duplicate_count == 0 |
Tests that there are no duplicate values in the ID column of the table. |
The test fails if it finds duplicate values. |
tests: missing_percentage < 3 |
Tests that less than 3% of the values in the CATEGORY column match the values set under missing_values . |
The test fails if more than 3% of the values in the column contain n/a or No category . |
tests: max - min < 20 |
Tests that the difference between the highest value and the lowest value in the SIZE column is less than 20. |
The test fails if the difference exceeds 20. |
tests: total_volume_us > 5000 |
Tests that the sum total of US transactions in the CUSTOMER_TRANSACTIONS column is greater than 5000. |
The test fails if the sum total is less than 5000. |
When Soda SQL scans the table, it returns the following scan output in your command-line interface.
$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
| SELECT column_name, data_type, is_nullable
| FROM information_schema.columns
| WHERE lower(table_name) = 'customers'
| AND table_catalog = 'datasource.database'
| AND table_schema = 'datasource.schema'
- 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
| SELECT
| COUNT(*),
| COUNT(CASE WHEN ID IS NULL THEN 1 END),
| COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
| MIN(LENGTH(ID)),
| AVG(LENGTH(ID)),
| MAX(LENGTH(ID)),
| FROM customers
- 0.557 seconds
row_count : 23543
missing : 23
invalid : 0
min_length: 9
avg_length: 9
max_length: 9
...more queries...
47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds
Go further
- Learn how to automate Soda SQL scans using your data pipeline orchestration tool such as:
- Apache Airflow
- AWS Glue
- Prefect
- Dagster
- Fivetran
- Matillion
- Luigi
- If you like the goals of this project, we welcome your contribution!
- Read more about How Soda SQL works.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for soda_sql-2.0.0b27-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a52ec4caa856533ae89e4641d22bc291bee0da9dbf07e24aab02416d072d3079 |
|
MD5 | 3e859c3a14df690c43752f3eb7659a18 |
|
BLAKE2b-256 | 75d2fc15580b90adb97d07a556ca4ae9b2b6eb8bf95baf6754a70f5fbfe89e96 |