Skip to main content

A small package for data quality rules using SQL

Project description

SQL Guard - Guarding data rules through SQL

A simple package for validating data using SQL. Only support for GoogleSQL (BigQuery) and DuckDB (PostgreSQL compatible).

Problem to be Solved

I was using Pandera for validating data in many tables located in BigQuery.

The problem was that in order to validate data using Pandera, I first had to bring those tables into in-memory either Pandas, Polars or Dask DataFrames.

But even tools like Polars or Dask face problems handling large volumes of data, for example, above 100GB (my case).

So instead of bringing that data to local memory, I thought it was just better to validate data inside BigQuery using SQL Manipulations, but in an easy way so that I can create data quality rules using Python.

Conversion: SQL Guard x Pandera

So imagine I have a table like below.

The ideia is to have each row as the grade of a student.

name: Name of the student  
age: Age of the student  
major: Major of the student (for example, Computer Science, Electrical Engineering etc.)  
semester: Semester of university, for example 1S/2024 indicating the first semester of 2024  
course: Course, for example, Algorithms, Data Structures, Calculus I, Calculus II etc.  
grade: Grade for the course. for example: 0, 5, 10  
failed: Boolean to indicate if the student failed the course based on grade (>=5)  
Name Age Major Semester Course Grade Failed
Daniel Carter 19 Computer Science 1S/2024 Algorithms 8.5 False
Theo Hill 19 Computer Science 1S/2024 Algorithms 9.0 False
Jessica Hall 19 Computer Science 1S/2024 Algorithms 8.0 False
Liam Carter 19 Computer Science 1S/2024 Algorithms 8.5 False
Zackary Hill 19 Computer Science 1S/2024 Algorithms 7.0 False

I can either define data quality rules from scratch or use a pandera DataFrameSchema object:

From Scratch:

from sqlguard.validator.CheckBase import ValidationCheck


data_rules = {

    'name': [ValidationCheck(check_name='is_string',
                            params=None,
                            error_msg=None,
                            ignore_nulls=False),
            ValidationCheck(check_name='regex_contains',
                            params={'value': '^[A-Z].*'},
                            error_msg=None,
                            ignore_nulls=False)],
    'age': [ValidationCheck(check_name='is_integer',
                            params=None,
                            error_msg=None,
                            ignore_nulls=False),
            ValidationCheck(check_name='between',
                         params={'min': 15, 'max': 150},
                         error_msg=None,
                         ignore_nulls=False)]
    }

From Pandera:

import pandera as pa

pandera_schema = pa.DataFrameSchema({

    "name": pa.Column(str, checks=pa.Check.str_matches(r"^[A-Z].*")), # Starting with capital letter
    "age": pa.Column(int, checks=pa.Check.in_range(min_value=15, max_value=150)) # Age must be between 15 and 150
})

I can convert DataFrameSchema to a compatible dictionary of data rules

from sqlguard.translators import SchemaParsers

panderaParser = SchemaParsers.SchemaParser.get_parser("pandera")
data_rules = panderaParser.parse(pandera_schema)

Validating: SQL Guard x Pandera

As long as we have our data_rules dictionary, we can create a SQLValidator object that spits out a SQL query with your rules applied.

from sqlguard.validator.SQLValidator import SQLValidator

sql_schema = SQLValidator(data_rules)
validation_query = sql_schema.generate_sql_report(from_source=TABLE_PATH)

print(validation_query)

Given we have our query, we can just run it using BigQuery client for python.

from google.cloud import bigquery

query_job = client.query(validation_query)  # API request
query_result = query_job.result()  # Waits for query to finish

df = query_result.to_dataframe()

print("--------RUN_SQL_GUARD--------")
print(df.to_string())
print()

If result is too large, you can pass n_wrong_counts=True to group wrong values.

validation_query = sql_schema.generate_sql_report(from_source=TABLE_PATH, n_wrong_count=True)

Comparison of Pandera Lazy Validation and SQL Guard Generated Report

SQL GUARD

column_name check_name params error_msg ignore_nulls wrong_value
course is_in {'value': ['Algorithms', 'Data Structures', 'Calculus I']} False Calculus II
course is_in {'value': ['Algorithms', 'Data Structures', 'Calculus I']} False Circuit Analysis
major is_in {'value': ['Computer Science']} False Electrical Engineering
age between {'min': 15, 'max': 21} False 22

PANDERA

{
  "DATA": {
    "DATAFRAME_CHECK": [
      {
        "schema": null,
        "column": "age",
        "check": "in_range(15, 21)",
        "error": "Column 'age' failed element-wise validator number 0: in_range(15, 21) failure cases: 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22"
      },
      {
        "schema": null,
        "column": "major",
        "check": "isin(['Computer Science'])",
        "error": "Column 'major' failed element-wise validator number 0: isin(['Computer Science']) failure cases: Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering"
      },
      {
        "schema": null,
        "column": "course",
        "check": "isin(['Algorithms', 'Data Structures', 'Calculus I'])",
        "error": "Column 'course' failed element-wise validator number 0: isin(['Algorithms', 'Data Structures', 'Calculus I']) failure cases: Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Calculus II, Calculus II, Calculus II, Calculus II, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis"
      }
    ]
  }
}

Details

For detailed usage of package, visit docs/ folder and take a look at two notebooks in order:

  • demo.ipynb
  • demo_duckdb.ipynb

Install

Just create your virtual environment and run:
pip install sql-guard

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sql_guard-0.0.3.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sql_guard-0.0.3-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file sql_guard-0.0.3.tar.gz.

File metadata

  • Download URL: sql_guard-0.0.3.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for sql_guard-0.0.3.tar.gz
Algorithm Hash digest
SHA256 daca022b96212c8af13f3e0fc30ac69ed83cad3d0076c1706dac13c7d1746341
MD5 cb9ec6175c363ad93fcc8a76b11ff7fd
BLAKE2b-256 07d80af027913cec47e59f59e56d19202358266997549baf69488ac03a41ec59

See more details on using hashes here.

File details

Details for the file sql_guard-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: sql_guard-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for sql_guard-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e139bc5ba10daa418759c58f4562f18f0b2fa028230a4179053244c3e2fc8c41
MD5 99438fa6611176b65d456172a9f5abfd
BLAKE2b-256 5e08e2fd7b1391a16f3bee64f320546df14a9f7d7c703a90f06d20abbe1836d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page