Skip to main content

TetraScience IDS artifact validator

Project description

TetraScience IDS Artifact Validator

Table of Contents

Overview

The TetraScience IDS Artifact Validator checks that IDS artifacts follow a set of rules which make them compatible with the Tetra Data Platform, and optionally validates that they are compatible with additional IDS design conventions. The validator either passes or fails with a list of the checks which led to the failure.

The validator checks these files in an IDS folder:

  • schema.json
  • expected.json
  • elasticsearch.json
  • athena.json

Note that there is a distinction between IDS Artifact validation and IDS instance validation. This validator checks that the IDS Artifact files listed above are valid; an IDS instance validator would check that a data instance is valid against the corresponding IDS - this package does not do IDS instance validation.

Version

v1.0.1

Installation

pip install ts-ids-validator

Alternatively, to install with pipx, which makes the CLI tool available globally with the benefit of isolated installation environments:

pipx install ts-ids-validator

Alternatively, install with a package manager such as poetry:

poetry add --dev ts-ids-validator

After installation, the validate-ids-artifact CLI script can be used, see the usage instructions below.

Usage

This section explains the validate-ids-artifact command and its options. For more detail about what is being validated, see the Validation section.

Run validate-ids-artifact -h to see the help for this command.

Validate an IDS

In the root of an IDS folder, run:

validate-ids-artifact

Or to validate an IDS which is in a different folder, run:

validate-ids-artifact --ids-dir path/to/ids/folder

This will validate whether the entire IDS artifact is compatible with the Tetra Data Platform.

An important part of IDS validation is checking for compatibility with existing versions of the same IDS, which can be done using either the --download or --git flags described in the next section.

Validate IDS compatibility with previous versions from TDP

Use the --download (-d) flag to validate a local IDS against other IDS versions downloaded from the Tetra Data Platform.

To use this, the first step is to configure TDP API authentication, see https://developers.tetrascience.com/reference/authentication for instructions.

There are two options for storing API configuration: environment variables, or a JSON config file.

For environment variables, set TS_API_URL, TS_ORG and TS_AUTH_TOKEN environment variables using any method, then run the validator with the --download flag:

# Omit these environment variables if they are already set elsewhere
export TS_ORG=your-org
export TS_API_URL=https://api.tetrascience.com/v1
export TS_AUTH_TOKEN=your-token

# Command to run once environment variables are set
validate-ids-artifact --download

# Alternative with shorter syntax
validate-ids-artifact -d

To use a JSON config file, create a JSON file with the following structure, named for example cfg.json (the name can be anything):

{
    "api_url": "https://api.tetrascience.com/v1",
    "auth_token": "your-token",
    "org": "your-org"
}

Then use both the --download flag and the --config option:

validate-ids-artifact --download --config cfg.json

# Alternative with shorter syntax
validate-ids-artifact -d -c cfg.json

Downloaded IDS Artifact files are cached so that repeatedly running this command doesn't re-download them each time.

Ignore SSL certificate verification for TDP API usage

It is possible to ignore verifying the SSL certificate for the TDP API requests used to identify and download the previous IDS artifact. The default functionality is to verify SSL certificates.

To do this, add "ignore_ssl": true to the JSON config file. It is false by default, which can also be set explicitly with "ignore_ssl": false.

Or set the environment variable TS_IGNORE_SSL to any of the following values (case insensitive): true, True, 1. It is false by default, which can also be set explicitly with one of the following values: false, False, 0.

When ignore SSL is True, API requests will accept any TLS certificate presented by the server, and will ignore hostname mismatches or expired certificates. This makes requests vulnerable to man-in-the-middle (MitM) attacks. Setting it to True may be useful during local development or testing. This is handled by the requests package, with ignore_ssl=True corresponding to verify=False, documented here.

Validate IDS compatibility with previous versions from git tags

Use the --git (-g) flag to validate a local IDS against other local IDS versions fetched from git tags.

validate-ids-artifact --git

# Alternative with shorter syntax
validate-ids-artifact -g

This is an alternative to the --download flag and it doesn't require configuring TDP credentials.

Using git tags requires the following to avoid incorrect validation results:

  • Tags must have the format vM.m.p or vM.m.p.B where M (major), m (minor), p (patch) and the optional B (build) are integers, for example v1.10.2.
  • The version in the tag must correspond to the version in schema.json's @idsVersion property, ignoring the build number.
  • All IDS artifact files at each tag must match the artifact available in TDP.

For these reasons, it is recommended to use --git only for local development or CI/CD checks when there is another system in place to make sure the version tags are in sync with the IDS artifacts in TDP.

Validation

This is an overview of the validation which is run by this IDS Artifact validator.

Tetra Data Platform compatibility

schema.json, expected.json, elasticsearch.json and athena.json must be present in the IDS artifact. The validation of each of them is described below.

schema.json

schema.json is validated against the JSON Schema draft 7 specification using jsonschema's Validator.check_schema method. This ensures that the JSON Schema vocabulary is being used correctly, including some format validation like the root $id and $schema being valid URIs.

Additional validation makes sure the schema meets other TDP requirements:

  • The top-level IDS object must contain properties "@idsType", "@idsVersion" and "@idsNamespace".
    • "@idsType" must be a constant with a value consisting of a "v" followed by a valid semantic version, such as "v1.0.0".
  • "$id" at the root of the schema must follow the format https://ids.tetrascience.com/<namespace>/<type>/<version>/schema.json where namespace, type and version are the constant values of the @idsNamespace, @idsType and @idsVersion properties.
  • "$schema" at the root of the schema must be the URI "http://json-schema.org/draft-07/schema#": draft 7 is the version of JSON Schema supported by TDP.
  • All objects must have additional_properties set to false
  • All properties must have a valid JSON Schema type
  • An object's required properties must be defined in the object's properties definition.

datacubes must have a schema which is compatible with the platform's requirements:

  • datacubes type must be an array of objects.
  • The properties name, dimensions and measures are present and required.
  • minItems == maxItems for dimensions and measures.
  • measures.value:
    • Contains nested arrays so that it is an N-dimensional array, with N being the number of dimensions.
    • The innermost type is either "number", ["number", "null"], "string", or ["string", "null"] (or equivalent).
  • dimensions.scale must be an array of numbers.

All properties in the schema must have valid names for mapping to Tetra SQL tables:

  • No leading underscores
  • No more than 1 consecutive underscore anywhere in the property
  • No special characters (allowed characters follow the regular expression [a-zA-Z0-9_]).
    • Exceptions to this rule are @idsNamespace, @idsType, @idsVersion, and @idsConventionVersion at the root level of the schema, and @link anywhere in the schema.
  • No two properties in schema.json may normalize to the same Athena column name.
    • For example, a property called name inside an object person will correspond to an Athena column of person_name. This means there cannot be another property called person_name defined at the same level as person, because person_name would clash with person.name when mapping the data to Athena.
  • No property may have the name uuid or parent_uuid because these are reserved for use as Athena column names in TDP.
  • When properties are normalized to Athena column names, no column name can exceed 255 characters

elasticsearch.json

All fields defined under mapping.properties in elasticsearch.json must exist within the IDS. There can only be up to 50 nested fields defined in the IDS.

athena.json

  • Partition paths must correspond to valid properties in schema.json
  • Partition paths cannot point to properties anywhere inside an array
  • partition.name cannot clash with any normalized property name from schema.json. For example, when schema.json properties are mapped to Athena, a property name inside an object person gets a normalized Athena column name of person_name, meaning partition.name cannot be person_name because it would clash with the person_name Athena column.

expected.json

expected.json must be a valid instance of schema.json using a JSON Schema draft 7 validator.

Versioning validation

Schema evolution - merge compatibility

Schema evolution refers to backwards-compatible updates to a schema, where data following the new schema can be merged with data following the old schema as long as certain "merge compatibility" conditions are met. A common case is adding a new property anywhere in a schema to capture additional data (at the root level, in an object, in an array of objects, etc.).

The new schema will be checked for merge compatibility with all existing IDSs with the same namespace, type and major version. For example, if versions v1.0.0 and v1.1.0 already exist, then when validating an update to v1.2.0, the new schema must be merge compatible with the existing merged schemas from v1.0.0 and v1.1.0.

This validation tool checks for merge compatibility according to Delta.io's schema evolution logic. Read more about Delta’s schema validation rules here.

The existing versions can be retrieved from TDP using the --download (-d) flag, or from git tags using the --git (-g) flag. One of these must be used in order for merge compatibility checks to run.

Instead of making a breaking change, typically it is possible to make schema updates which are merge compatible by adding or removing fields without changing the type of any existing field. This brings the benefit that new and old data can be written to the same Lakehouse table, meaning existing integrations may be unaffected if they don't depend on the new or removed fields.

In cases where a merge incompatible change is necessary, bump the IDS major version.

Athena configuration breaking changes

Any change to the configuration in athena.json is a breaking change requiring a major version bump. This is because changes will only be applied when new CSV-backed Athena tables are created, which only happens for new major versions. To make a change to the Athena configuration file, bump the major version of the IDS.

Tetra Data validation

These checks validate that Tetra Data conventions are being followed in this IDS's schema.json. These checks are enabled by including the top-level metadata field is_tetra_data_schema with a value of true or by adding the property @idsConventionVersion with a const value of v1.0.0 in schema.json.

WARNING: @idsConventionVersion will be deprecated in a future release. Please use "is_tetra_data_schema": true instead.

Note that this Tetra Data specific validation will be removed from this package in a future version, so that it will only validate Tetra Data Platform requirements for IDS Artifacts.

  • Property names should use snake case: entirely lower-case or numeric characters separated by single consecutive underscores.
    • related_files.pointer's fileId and fileKey properties are excluded from this check. So are any properties whose name starts with @.
  • For standard Tetra Data components, there is validation that the schema matches the expected component structure, including property names, types and required properties. This applies to samples, users, systems and related_files. The validator output will explain any differences from the expected component structure if this check fails.

Changelog

v1.0.1

  • Fix the interface for ts-sdk and ts-cli to download TDP artifacts for validation when publishing an IDS

v1.0.0

  • Add schema merge compatibility validation for Lakehouse schema evolution
    • Schema evolution allows for a limited set of schema changes like adding or removing fields without bumping the IDS major version, with data from the combined previous versions and the new version all being available in the same table in Lakehouse under the same major version.
    • This is enabled for all IDSs and requires using either the --download (-d) flag or --git (-g) flag to compare against existing IDS versions. See the updated ts-ids-validator Readme for complete details.
    • The previous CSV-backed Athena versioning validation has been changed to emit warning logs without failing validation. For users of CSV-backed Athena, the information in these logs is still relevant when updating an IDS to determine whether a major version bump is needed, otherwise it can be ignored.
  • Update Tetra Data validation to remove prefix check which validates that properties of an object do not start with the name of the field whose definition is the object (e.g. the field method of type object cannot contain a property named method_name)
  • Update Tetra Data validation to be triggered off of the top-level schema.json metadata "is_tetra_data_schema": true
    • Add deprecation warning for the @idsConventionVersion property which is no longer necessary
  • Allow dashes as well as underscores in CLI argument names, for example both --ids-dir and --ids_dir may now be used equivalently.

v0.10.5

  • Add elasticsearch.json validation to enforce a maximum of 50 nested fields
  • Add column resolution validation to ensure normalized Athena columns do not exceed 255 characters

v0.10.4

  • Update the --download option to accept an ignore_ssl configuration (either an "ignore_ssl" key in the JSON config, or the TS_IGNORE_SSL environment variable). This makes the API configuration the same as what ts-sdk accepts. See "Ignore SSL certificate verification" in the Readme for more information.

v0.10.3

  • Update validation so it is mandatory to include samples and users fields along with specific definitions of each at the top level of Tetra Data IDSs to match documented conventions. This enables downstream use cases which depend on these fields always being present, even if they are not populated by any Tetra Data protocol.
  • Add ability to download previous IDS artifacts from the Tetra Data Platform for versioning validation, as an alternative to the --previous-ids-dir CLI option. This adds a --download/-d flag to the CLI, to specify that the previous IDS artifact should be downloaded from TDP. API configuration can be set using environment variables, or the --config/-c CLI option which takes a path to an API configuration JSON file. See the README for details.

v0.10.2

  • Improve readability of versioning requirements and breaking change validation in validation output.

v0.10.1

  • Remove unused dependencies which caused installation to fail in some Python versions.

v0.10.0

  • Update how to use this package: add a CLI script validate-ids-artifact which is equivalent to the previous approach of running python -m ids_validator. This previous approach still works as before but may be deprecated in a future version, so switching to the CLI script is recommended.
  • Add validation that property names do not contain special characters.
  • Add validation for breaking changes between IDS versions which would be incompatible with the Tetra Data Platform
    • Add the new CLI argument --previous-ids-dir (which may be omitted) which is the folder containing the previous version of the same IDS namespace and type. Without this CLI argument, breaking change validation does not run.
  • Add validation that expected.json is a valid instance of schema.json using a JSON Schema validator.
  • Remove --version CLI argument because this information can be retrieved from the IDS artifact being validated.

v0.9.16

  • Remove the upper bound of what properties samples may contain for Tetra Data validation. This means the samples schema can now include properties other than the ones in the samples Tetra Data component, such as primary and foreign key fields.

v0.9.15

  • Limit version of typing-extensions in dependencies to avoid a bug which causes the validator to always fail in Python 3.10 or later.

v0.9.14

  • Update samples[*] check to optionally allow for it to contain a property pk_samples of type "string".

v0.9.13

  • related_files is no longer checked against annotation fields like "description".

v0.9.12

  • Update check for samples[*].labels[*].source.name type: previously the type was required to be "string", now it is required to be either ["string", "null"] or "string", with "string" leading to a deprecation warning. This change makes this source definition the same as samples[*].properties[*].source in a backward-compatible way.

v0.9.11

  • Fix bug in AthenaChecker to allow root level IDS properties as partition paths.
  • Update TypeChecker to catch errors related to undefined/misspelled type key.
  • Update jsonschema version to fix package installation error

v0.9.10

  • Modify V1SnakeCaseChecker to ignore checks for keys present in definitions object.
  • Add temporary allowance for @link in *.properties

v0.9.9

  • Lock jsonschema version in requirements.txt

v0.9.8

  • Modify RulesChecker to log missing and extra properties

v0.9.7

  • Allow properties with const values to have non-nullable type

v0.9.6

  • Add checker classes for generic validation
  • Add checker classes for v1.0.0 convention validation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ts_ids_validator-1.0.1.tar.gz (69.1 kB view details)

Uploaded Source

Built Distribution

ts_ids_validator-1.0.1-py3-none-any.whl (91.0 kB view details)

Uploaded Python 3

File details

Details for the file ts_ids_validator-1.0.1.tar.gz.

File metadata

  • Download URL: ts_ids_validator-1.0.1.tar.gz
  • Upload date:
  • Size: 69.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for ts_ids_validator-1.0.1.tar.gz
Algorithm Hash digest
SHA256 6a474c61ced58ab7d9efc362b578f01302335c323d120065301fa2cce5ab4abd
MD5 e078c105818bd4a4c8977315dbff73fe
BLAKE2b-256 ecc5db08aab105113915f1a127f593d1e9a6fa6445df912567779c2f28d24379

See more details on using hashes here.

File details

Details for the file ts_ids_validator-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ts_ids_validator-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7b97532173873254b30a1dd58dce4b0dfca6897a51a83dc6539343f07f9f9c0f
MD5 bfd87249a086aa9a334135c1d441383c
BLAKE2b-256 e8d486fb02b96c2b6898dac5156e2b3f166fdbb972611f870cd1b9e0c00b9862

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page