Skip to main content

`Datajudge` allows to assess whether data from database complies with reference

Project description

datajudge

CI

Express and test specifications against data from database.

Why not great expectations?

The major selling point is to be able to conveniently express expectations between different DataSources. great expectations, in contrast, focuses on expectations against single DataSources.

Over time, we have found some people find the following aspects of datajudge useful

  • Lots of the 'query writing' is off-loaded to the library by having Constraints tailored to our needs
  • Easier/faster onboarding than with great expectations
  • Assertion messages with counterexamples and other context information, speeding up the data debugging process

Installation instructions

datajudge can either be installed via pypi with pip install datajudge or via conda-forge with conda install datajudge -c conda-forge.

You will likely want to use datajudge in conjuction with other packages - in particular pytest and database drivers relevant to your database. You might want to install snowflake-sqlalchemy when using snowflake, pyscopg when using postgres and platform-specific drivers (Windows, Linux, macOS) when using mssql.

For development on datajudge - in contrast to merely using it - you can get started as follows:

git clone https://github.com/Quantco/datajudge
cd datajudge
mamba env create
conda activate datajudge
pip install --no-build-isolation --disable-pip-version-check -e .

Example

To get started, we will create a sample database using sqlite that contains a list of companies.

import sqlalchemy as sa

eng = sa.create_engine('sqlite:///example.db')

with eng.connect() as con:
    con.execute("CREATE TABLE companies (id INTEGER PRIMARY KEY, name TEXT, num_employees INTEGER)")
    con.execute("INSERT INTO companies (name, num_employees) VALUES ('QuantCo', 100), ('Google', 150000), ('BMW', 120000), ('Apple', 145000)")
    con.execute("CREATE TABLE companies_archive (id INTEGER PRIMARY KEY, name TEXT, num_employees INTEGER)")
    con.execute("INSERT INTO companies_archive (name, num_employees) VALUES ('QuantCo', 90), ('Google', 140000), ('BMW', 110000)")

The table "companies_archive" contains three entries:

id name num_employees
1 QuantCo 90
2 Google 140,000
3 BMW 110,000

While "companies" contains an additional entry:

id name num_employees
1 QuantCo 100
2 Google 150,000
3 BMW 120,000
4 Apple 145,000

As an example, we will run 4 tests on this table:

  • Does the table "companies" contain a column named "name"?
  • Does the table "companies" contain at least 1 entry with the name "QuantCo"?
  • Does the column "num_employees" of the "companies" table have all positive values?
  • Does the column "name" of the table "companies" contain at least all the values of the corresponding column in "companies_archive"?
import pytest
import sqlalchemy as sa

from datajudge import (
    Condition,
    WithinRequirement,
    BetweenRequirement,
)
from datajudge.pytest_integration import collect_data_tests


# We create a Requirement (within table). This object will contain
# all the constraints we want to test on the specified table.
# To test another table or test the same table against another table,
# we would create another Requirement object.
companies_req = WithinRequirement.from_table(
    db_name="example", schema_name=None, table_name="companies"
)

# Adding a constraint: column "name" should exist in our table
companies_req.add_column_existence_constraint(columns=["name"])

# Adding a constraint with a condiction
condition = Condition(raw_string="name = 'QuantCo'")
companies_req.add_n_rows_min_constraint(n_rows_min=1, condition=condition)

companies_req.add_numeric_min_constraint(column="num_employees", min_value=1)

# Creating a requirement (set of constraints) to test constraint between
# the table "companies" and the table "companies_archive".
companies_between_req = BetweenRequirement.from_tables(
    db_name1="example",
    schema_name1=None,
    table_name1="companies",
    db_name2="example",
    schema_name2=None,
    table_name2="companies_archive",
)

companies_between_req.add_row_superset_constraint(
    columns1=['name'], columns2=['name'], constant_max_missing_fraction=0
)


@pytest.fixture(scope="module")
def datajudge_engine():
    return sa.create_engine("sqlite:///example.db")


requirements = [companies_req, companies_between_req]
# "collect_data_tests" expects a pytest fixture with the name "datajudge_engine" that is a SQLAlchemy engine.
test_constraint = collect_data_tests(requirements)

Saving this file as specification.py and calling pytest specification.py -v will verify that all constaints are satisfied. The output you see in the terminal should be similar to this:

=================================== test session starts ===================================
platform darwin -- Python 3.9.7, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/marc_qco/miniconda3/envs/datajudge/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.9.7', 'Platform': 'macOS-11.6-x86_64-i386-64bit', 'Packages': {'pytest': '6.2.5', 'py': '1.11.0', 'pluggy': '1.0.0'}, 'Plugins': {'html': '3.1.1', 'metadata': '1.11.0', 'cov': '3.0.0'}}
rootdir: /Users/marc_qco/repos/datajudge
plugins: html-3.1.1, metadata-1.11.0, cov-3.0.0
collected 4 items

specification.py::test_constraint[ColumnExistence::companies] PASSED                [ 25%]
specification.py::test_constraint[NRowsMin::companies] PASSED                       [ 50%]
specification.py::test_constraint[NumericMin::companies] PASSED                     [ 75%]
specification.py::test_constraint[RowSuperset::companies|companies_archive] PASSED  [100%]

==================================== 4 passed in 0.31s ====================================

You can also use a formatted html report using the --html=report.html flag.

Usage instructions

Creating a specification

In order to get going, you might want to copy skeleton.py to the location of your choice. The skeleton represents a specification. A specification contains Requirements, each expressing all expectations against either a single data source with WithinRequirement or all expectations against an ordered pair of data sources with BetweenRequirement. Requirements are composed of Constraints. See the section on Constraints below.

Testing a specification

Run pytest test_$placeholder.py --html=report.html.

You can inspect the test results either directly in your terminal or via the generated html report report.html.

As this relies on pytest, all of pytest's feature can be leveraged

Specifying constraints

In order to discover possible Constraints, please investigate the _add_*_constraint methods in requirements.py. The latter are meant to be mostly self-documenting through usage of expressive parameters.

Note that most Constraints will allow for at least one Condition. A Condition can be thought of as a conditional elevent in probability theory or filter/clause in a datbase query. Please consult the doc string of Condition for greater detail. For examples, please see tests/unit/test_condition.py.

Many Constraints have optional columns parameters. If no argument is given, all columns available from a table will be used.

Constraints on BetweenRequirements featuring a gain or loss limitation offer both a constant maximal reference value as well a comparison of date ranges. The latter aproach allows for some further manual adjustment. In case one intends to use a date range baseline, the date column must be passed during the instantiation of the Requirement and date_range_* must be passed in the respective add_*_constraint method. In said case, the constant_max_* argument can safely be ignored. E.g.:

#rows_1 < max_relative_gain * #rows_0

or

#rows_1 < ((max_date_1 - min_date_1)/(max_date_0 - min_date_0) - 1 + date_range_gain_deviation) #rows_0

Alternative DataSources

The 'Example' section above creates a Requirement. Such a Requirement object is instantiated with fixed DataSources. A DataSource expresses how the data to be tested is obtained from database. In the example above, each source is a plain and simple database table. Yet, at times, one might want to check expectations and impose constraints on objects which are derived from primitive tables of a database.

For that purpose, the WithinRequirement class supports two further constructors: from_expression and from_raw_query. Analogously, BetweenRequirement supports from_expressions and from_raw_queries.

These constructors allow to create a Requirement object based on a sqlalchemy expression or a select query in raw string format respectively. For greater detail, please refer to the respective Requirement class.

Note that in principle, several tables can be combined to make up for a single data source. Yet, most of the time when trying to compare two tables, it is more convenient to create a BetweenRequirement and use the from_tables constructor.

Column capitalization

Different database management systems handle capitalization of entities such as column names differently. For the time being, we define the following behavior:

  • Mssql: The user should provide the column name capitalization as is seen in database, either lower or uppercase.
  • Postgres: The user should always provide lower case column names.
  • Snowflake: Whatever capitalization the user provides, it'll be translated to lower case capitalization.

The snowflake behavior is due to an upstream bug in snowflake-sqlalchemy.

This behavior is subject to change.

Logging

The TestResult class comes with a logging_message property. The latter comprises information about the test failure, the constraint at hand as well as the underlying database queries.

Database management system support

While designed as dbms-agnostically as possible, datajudge likely doesn't work in its entirety for every dbms. We run integration tests against various versions of postgres, mssql and snowflake.

Nomenclature

Constraint < Requirement < Specification.

A Constraint captures a comparison between either two DataSources or a single DataSource and a reference value.

A Requirement captures all Constraints between two given DataSources or all Constraints within a DataSource.

A specification captures all requirements against a database.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datajudge-1.0.0.tar.gz (48.7 kB view hashes)

Uploaded Source

Built Distribution

datajudge-1.0.0-py3-none-any.whl (37.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page