Skip to main content

Validate data contracts

Project description

Data Contract CLI

Test Workflow Stars

The datacontract CLI lets you work with your datacontract.yaml files locally, and in your CI pipeline. It uses the Data Contract Specification to validate the contract, connect to your data sources and execute tests. The CLI is open source and written in Python. It can be used as a CLI tool or directly as a Python library.

NOTE: This project has been migrated grom Go to Python which adds the possibility to use datacontract withing Python code as library, but it comes with some breaking changes. The Golang version has been forked, if you rely on that.

Usage

datacontract usually works with a datacontract.yaml file in your current working directory. You can specify a different file or URL as an additional argument.

# create a new data contract
$ datacontract init

# execute schema and quality checks
$ datacontract test

Advanced Usage

# lint the data contract
$ datacontract lint datacontract.yaml

# find differences between to data contracts (Coming Soon)
$ datacontract diff datacontract-v1.yaml datacontract-v2.yaml

# fail pipeline on breaking changes  (Coming Soon)
$ datacontract breaking datacontract-v1.yaml datacontract-v2.yaml

# export model as jsonschema
$ datacontract export --format jsonschema

# export model as dbt  (Coming Soon)
$ datacontract export --format dbt

# import protobuf as model (Coming Soon)
$ datacontract import --format protobuf --source my_protobuf_file.proto

Programmatic (Python)

from datacontract.data_contract import DataContract

data_contract = DataContract(data_contract_file="datacontract.yaml")
run = data_contract.test()
if not run.has_passed():
    print("Data quality validation failed.")
    # Abort pipeline, alert, or take corrective actions...

Scenario: Integration with Data Mesh Manager

# Fetch current data contract, execute tests on production, and publish result to data mesh manager
$ EXPORT DATAMESH_MANAGER_API_KEY=xxx
$ datacontract test https://demo.datamesh-manager.com/demo279750347121/datacontracts/4df9d6ee-e55d-4088-9598-b635b2fdcbbc/datacontract.yaml --server production --publish

Scenario: CI/CD testing for breaking changes

# fail pipeline on breaking changes in the data contract yaml (coming soon)
$ datacontract breaking datacontract.yaml https://raw.githubusercontent.com/datacontract/cli/main/examples/my-data-contract-id_v0.0.1.yaml

Installation

Pip

pip install datacontract-cli

Documentation

Tests

Data Contract CLI can connect to data sources and run schema and quality tests to verify that the data contract is valid.

datacontract test

To connect to the databases the server block in the datacontract.yaml is used to set up the connection. In addition, credentials, such as username and passwords, may be defined with environment variables.

The application uses different engines, based on the server type.

Type Format Description Status Engines
s3 parquet Works for any S3-compliant endpoint., e.g., AWS S3, GCS, MinIO, Ceph, ... soda-core-duckdb
s3 json Support for new_line delimited JSON files and one JSON record per file. fastjsonschema
soda-core-duckdb
s3 csv soda-core-duckdb
s3 delta Coming soon TBD
postgres n/a Coming soon TBD
snowflake n/a soda-core-snowflake
bigquery n/a Coming soon TBD
redshift n/a Coming soon TBD
databricks n/a Coming soon TBD
kafka json Coming soon TBD
kafka avro Coming soon TBD
kafka protobuf Coming soon TBD
local parquet soda-core-duckdb
local json Support for new_line delimited JSON files and one JSON record per file. fastjsonschema
soda-core-duckdb
local csv soda-core-duckdb

Feel free to create an issue, if you need support for an additional type.

Server Type S3

Example:

datacontract.yaml

servers:
  production:
    type: s3
    endpointUrl: https://minio.example.com # not needed with AWS S3
    location: s3://bucket-name/path/*/*.json
    delimiter: new_line # new_line, array, or none
    format: json

Environment variables

export DATACONTRACT_S3_REGION=eu-central-1
export DATACONTRACT_S3_ACCESS_KEY_ID=AKIAXV5Q5QABCDEFGH
export DATACONTRACT_S3_SECRET_ACCESS_KEY=93S7LRrJcqLkdb2/XXXXXXXXXXXXX

Development Setup

Python base interpreter should be 3.11.x

# create venv
python3 -m venv venv
source venv/bin/activate

# Install Requirements
pip install --upgrade pip setuptools wheel
pip install -e '.[dev]'
cd tests/
pytest

Release

git tag v0.9.0
git push origin v0.9.0
python3 -m pip install --upgrade build twine
rm -r dist/
python3 -m build
# for now only test.pypi.org
python3 -m twine upload --repository testpypi dist/*

Docker Build

docker build -t datacontract .
docker run --rm -v ${PWD}:/app datacontract

Contribution

We are happy to receive your contributions. Propose your change in an issue or directly create a pull request with your improvements.

License

MIT License

Credits

Created by Stefan Negele and Jochen Christ.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacontract-cli-0.9.0.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

datacontract_cli-0.9.0-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file datacontract-cli-0.9.0.tar.gz.

File metadata

  • Download URL: datacontract-cli-0.9.0.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for datacontract-cli-0.9.0.tar.gz
Algorithm Hash digest
SHA256 d5e9494c5d34fbb210ebf42dbb7f654a33d1920cebef5942f28db07d13970240
MD5 3784d1ab2fcf6cbb9725091ae2c6d76e
BLAKE2b-256 f39ba4d41b1c77c56da4a91510830977cd1af296dfdc7ed8ebe93eba1480485a

See more details on using hashes here.

File details

Details for the file datacontract_cli-0.9.0-py3-none-any.whl.

File metadata

File hashes

Hashes for datacontract_cli-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6bd31b798d18233cb78f8731acd23bfa98c4045789ea6053c6e41fcd83880861
MD5 817976ad3963afc71fa46d4bd6e1e569
BLAKE2b-256 af02304a628dc16dcebc549b2b95de8383fe4d387b3803cec2895743ecbf0256

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page