Skip to main content

Validate data contracts

Project description

Data Contract CLI

Test Workflow Stars

The datacontract CLI is an open source command-line tool for working with Data Contracts. It uses data contract YAML files to lint the data contract, connect to data sources and execute schema and quality tests, detect breaking changes, and export to different formats. The tool is written in Python. It can be used as a standalone CLI tool, in a CI/CD pipeline, or directly as a Python library.

NOTE: This project has been migrated from Go to Python which adds the possibility to use datacontract within Python code as library, but it comes with some breaking changes. The Go version has been forked, if you still rely on that.

Getting started

Let's use pip to install the CLI.

$ pip3 install datacontract-cli

Now, let's look at this data contract: https://datacontract.com/examples/covid-cases/datacontract.yaml

We have a servers section with endpoint details to the (public) S3 bucket, models for the structure of the data, and quality attributes that describe the expected freshness and number of rows.

This data contract contains all information to connect to S3 and check that the actual data meets the defined schema and quality requirements.

We run the tests:

$ datacontract test https://datacontract.com/examples/covid-cases/datacontract.yaml
# returns: 🟢 data contract is valid. Run 12 checks.

Voilà, the CLI tested that the datacontract.yaml itself is valid, all records comply with the schema, and all quality attributes are met.

Usage

# create a new data contract from example and write it to datacontract.yaml
$ datacontract init datacontract.yaml

# lint the datacontract.yaml
$ datacontract lint datacontract.yaml

# execute schema and quality checks
$ datacontract test datacontract.yaml

# find differences between to data contracts (Coming Soon)
$ datacontract diff datacontract-v1.yaml datacontract-v2.yaml

# fail pipeline on breaking changes  (Coming Soon)
$ datacontract breaking datacontract-v1.yaml datacontract-v2.yaml

# export model as jsonschema
$ datacontract export --format jsonschema datacontract.yaml

# export model as dbt  (Coming Soon)
$ datacontract export --format dbt datacontract.yaml

# import protobuf as model (Coming Soon)
$ datacontract import --format protobuf --source my_protobuf_file.proto datacontract.yaml

Programmatic (Python)

from datacontract.data_contract import DataContract

data_contract = DataContract(data_contract_file="datacontract.yaml")
run = data_contract.test()
if not run.has_passed():
    print("Data quality validation failed.")
    # Abort pipeline, alert, or take corrective actions...

Scenario: Integration with Data Mesh Manager

If you use Data Mesh Manager, you can use the data contract URL and append the --publish option to send and display the test results. Set an environment variable for your API key.

# Fetch current data contract, execute tests on production, and publish result to data mesh manager
$ EXPORT DATAMESH_MANAGER_API_KEY=xxx
$ datacontract test https://demo.datamesh-manager.com/demo279750347121/datacontracts/4df9d6ee-e55d-4088-9598-b635b2fdcbbc/datacontract.yaml --server production --publish

Installation

Choose the most appropriate installation method for your needs:

pip

Python 3.11 recommended. Python 3.12 available as pre-release release candidate for 0.9.3

pip3 install datacontract-cli

pipx

pipx installs into an isolated environment.

pipx install datacontract-cli

Homebrew (coming soon)

brew install datacontract/brew/datacontract

Docker (coming soon)

docker pull datacontract/cli
docker run --rm -v ${PWD}:/datacontract datacontract/cli

Documentation

Tests

Data Contract CLI can connect to data sources and run schema and quality tests to verify that the data contract is valid.

$ datacontract test --server production datacontract.yaml

To connect to the databases the server block in the datacontract.yaml is used to set up the connection. In addition, credentials, such as username and passwords, may be defined with environment variables.

The application uses different engines, based on the server type.

Type Format Description Status Engines
s3 parquet Works for any S3-compliant endpoint., e.g., AWS S3, GCS, MinIO, Ceph, ... soda-core-duckdb
s3 json Support for new_line delimited JSON files and one JSON record per file. fastjsonschema
soda-core-duckdb
s3 csv soda-core-duckdb
s3 delta Coming soon TBD
postgres n/a Coming soon TBD
snowflake n/a soda-core-snowflake
bigquery n/a soda-core-bigquery
redshift n/a Coming soon TBD
databricks n/a Coming soon TBD
kafka json Coming soon TBD
kafka avro Coming soon TBD
kafka protobuf Coming soon TBD
local parquet soda-core-duckdb
local json Support for new_line delimited JSON files and one JSON record per file. fastjsonschema
soda-core-duckdb
local csv soda-core-duckdb

Feel free to create an issue, if you need support for an additional type.

Server Type S3

Example:

datacontract.yaml

servers:
  production:
    type: s3
    endpointUrl: https://minio.example.com # not needed with AWS S3
    location: s3://bucket-name/path/*/*.json
    delimiter: new_line # new_line, array, or none
    format: json

Environment variables

export DATACONTRACT_S3_REGION=eu-central-1
export DATACONTRACT_S3_ACCESS_KEY_ID=AKIAXV5Q5QABCDEFGH
export DATACONTRACT_S3_SECRET_ACCESS_KEY=93S7LRrJcqLkdb2/XXXXXXXXXXXXX

Server Type BigQuery

We support authentication to BigQuery using Service Account Key. The used Service Account should include the roles:

  • BigQuery Job User
  • BigQuery Data Viewer

Example:

datacontract.yaml

servers:
  production:
    type: bigquery
    project: datameshexample-product
    dataset: datacontract_cli_test_dataset
models:
  datacontract_cli_test_table: # corresponds to a BigQuery table
    type: table
    fields: ...

Required environment variable:

export DATACONTRACT_BIGQUERY_ACCOUNT_INFO_JSON_PATH=~/service-access-key.json # as saved on key creation by BigQuery

Development Setup

Python base interpreter should be 3.11.x (unless working on 3.12 release candidate).

# create venv
python3 -m venv venv
source venv/bin/activate

# Install Requirements
pip install --upgrade pip setuptools wheel
pip install -e '.[dev]'
cd tests/
pytest

Release

git tag v0.9.0
git push origin v0.9.0
python3 -m pip install --upgrade build twine
rm -r dist/
python3 -m build
# for now only test.pypi.org
python3 -m twine upload --repository testpypi dist/*

Docker Build

docker build -t datacontract/cli .
docker run --rm -v ${PWD}:/datacontract datacontract/cli

Contribution

We are happy to receive your contributions. Propose your change in an issue or directly create a pull request with your improvements.

License

MIT License

Credits

Created by Stefan Negele and Jochen Christ.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacontract-cli-0.9.3.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

datacontract_cli-0.9.3-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file datacontract-cli-0.9.3.tar.gz.

File metadata

  • Download URL: datacontract-cli-0.9.3.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for datacontract-cli-0.9.3.tar.gz
Algorithm Hash digest
SHA256 4f8e8423eedf6381828b9b354c434cc6d3f94ebcf09b43ebabc7fc32ccc5ed78
MD5 d8d53b4c10c325d832a4f87e1ec29ac5
BLAKE2b-256 13d89ddabf2abb1c26b44e238516ef5967c012bde0bd9e67c2baf328be125cc9

See more details on using hashes here.

File details

Details for the file datacontract_cli-0.9.3-py3-none-any.whl.

File metadata

File hashes

Hashes for datacontract_cli-0.9.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a4646204338861a1b721c501fa654fb6677040d20fc93babe917e368d1f28225
MD5 c1973b8fc050cb11af7ed93e729a676f
BLAKE2b-256 a9d53005d6b9324899152541b93bd9f6667f28662f830b4fb9b534555453008d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page