TetraScience IDS artifact validator
Project description
TetraScience IDS Artifact Validator
Table of Contents
Overview
The TetraScience IDS Artifact Validator checks that IDS artifacts follow a set of rules which make them compatible with the Tetra Data Platform, and optionally validates that they are compatible with additional IDS design conventions. The validator either passes or fails with a list of the checks which led to the failure.
The validator checks these files in an IDS folder:
- schema.json
- elasticsearch.json
- athena.json
Note that there is a distinction between IDS Artifact validation and IDS instance validation. This validator checks that the IDS Artifact files listed above are valid; an IDS instance validator would check that a data instance is valid against the corresponding IDS schema using a JSON Schema validator - this package does not do IDS instance validation.
Version
v0.10.5
Installation
Installing this package requires access to the TetraScience JFrog Artifactory package repository.
Run pip install ts-ids-validator
in the environment where you want to install the validator.
Or install using a package manager, with poetry
for example: poetry add --dev ts-ids-validator
.
Installing using pipx makes the CLI script available globally while still keeping dependencies in an isolated environment which may be useful when working with multiple IDSs: pipx install ts-ids-validator
.
This will install a script which can be run from the command line, validate-ids-artifact
, which is equivalent to running python -m ids_validator
- see below for usage.
Usage
Run validate-ids-artifact -h
to see the help for this command.
Validate an IDS
With the CLI interface:
validate-ids-artifact --ids_dir=path/to/ids/folder
This will validate that the IDS is compatible with the Tetra Data Platform.
If the schema contains properties.@idsConventionVersion
with a const
value of v1.0.0
, then additional Tetra Data checks will run, validating that the IDS follows certain Tetra Data conventions such as using the standard samples
component.
Note: in a future version of the validator, these @idsConventionVersion
checks will be removed
Validate an IDS and check for breaking changes from a previous version
Validate with a local copy of the previous IDS
validate-ids-artifact --ids_dir=path/to/ids/folder --previous_ids_dir=path/to/previous_ids/folder
# Alternative with shorter syntax
validate-ids-artifact -i path/to/ids/folder -p path/to/previous_ids/folder
As well as running the validation of the IDS in ids_dir
, additional validation will happen using the previous version of the same IDS passed to previous_ids_dir
.
Validate with the previous published version downloaded from the Tetra Data Platform
Use this feature to validate a local IDS against its previous version downloaded directly from the Tetra Data Platform.
This will download the closest preceding version with the same namespace and slug from TDP, and use it for breaking change validation. If no matching IDS is found, then validation will still run without breaking change validation.
For example, if v1.0.0 and v2.0.0 are already published in TDP, and you are working on v2.0.1, then v2.0.0 will be used as the previous version in validation. If you are working on v1.0.1, then v1.0.0 will be used. If you are working on v0.1.0, then no breaking change validation will run. As an edge case, if you are working on v2.0.0 locally, then v2.0.0 from TDP will be used as the "previous" version.
To use this, the first step is to configure TDP API authentication, see https://developers.tetrascience.com/reference/authentication for instructions.
There are two options for storing API configuration: environment variables, or a JSON config file.
For environment variables, set TS_API_URL
, TS_ORG
and TS_AUTH_TOKEN
environment variables using any method, then run the validator with the --download
flag:
# Omit these environment variables if they are already set elsewhere
export TS_ORG=your-org
export TS_API_URL=https://api.tetrascience.com/v1
export TS_AUTH_TOKEN=your-token
# Command to run once environment variables are set
validate-ids-artifact --download --ids_dir path/to/ids/folder
# Alternative with shorter syntax
validate-ids-artifact -d -i path/to/ids/folder
To use a JSON config file, create a JSON file with the following structure, named for example cfg.json
(the name can be anything):
{
"api_url": "https://api.tetrascience.com/v1",
"auth_token": "your-token",
"org": "your-org"
}
Then use both the --download
flag and the --config
option:
validate-ids-artifact --download --ids_dir path/to/ids/folder --config cfg.json
# Alternative with shorter syntax
validate-ids-artifact -d -i path/to/ids/folder -c cfg.json
Ignore SSL certificate verification for TDP API usage
It is possible to ignore verifying the SSL certificate for the TDP API requests used to identify and download the previous IDS artifact. The default functionality is to verify SSL certificates.
To do this, add "ignore_ssl": true
to the JSON config file. It is false by default, which can also be set explicitly with "ignore_ssl": false
.
Or set the environment variable TS_IGNORE_SSL
to any of the following values (case insensitive): true
, True
, 1
. It is false by default, which can also be set explicitly with one of the following values: false
, False
, 0
.
When ignore SSL is True, API requests will accept any TLS certificate presented by the server, and will ignore hostname mismatches or expired certificates.
This makes requests vulnerable to man-in-the-middle (MitM) attacks.
Setting it to True may be useful during local development or testing.
This is handled by the requests
package, with ignore_ssl=True
corresponding to verify=False
, documented here.
Validation
This is an overview of the validation which is run by this IDS Artifact validator.
Generic
schema.json
, expected.json
, elasticsearch.json
and athena.json
must be present in the IDS artifact.
The validation of each of them is described below.
schema.json
schema.json
is validated against the JSON Schema draft 7 specification using jsonschema
's Validator.check_schema method.
This ensures that the JSON Schema vocabulary is being used correctly, including some format validation like the root $id
and $schema
being valid URIs.
Additional validation makes sure the schema meets other TDP requirements:
- The top-level IDS object must contain properties
"@idsType"
,"@idsVersion"
and"@idsNamespace"
."@idsType"
must be a constant with a value consisting of a "v" followed by a valid semantic version, such as"v1.0.0"
.
"$id"
at the root of the schema must follow the formathttps://ids.tetrascience.com/<namespace>/<type>/<version>/schema.json
where namespace, type and version are the constant values of the@idsNamespace
,@idsType
and@idsVersion
properties."$schema"
at the root of the schema must be the URI"http://json-schema.org/draft-07/schema#"
: draft 7 is the version of JSON Schema supported by TDP.- All objects must have
additional_properties
set tofalse
- All properties must have a valid JSON Schema type
- An object's
required
properties must be defined in the object'sproperties
definition.
datacubes
must have a schema which is compatible with the platform's requirements:
datacubes
type must be an array of objects.- The properties
name
,dimensions
andmeasures
are present andrequired
. minItems == maxItems
fordimensions
andmeasures
.measures.value
:- Contains nested arrays so that it is an
N
-dimensional array, withN
being the number ofdimensions
. - The innermost type is either
"number"
,["number", "null"]
,"string"
, or["string", "null"]
(or equivalent).
- Contains nested arrays so that it is an
dimensions.scale
must be an array ofnumber
s.
All properties in the schema must have valid names for mapping to Athena:
- No leading underscores
- No more than 1 consecutive underscore anywhere in the property
- No special characters (allowed characters follow the regular expression
[a-zA-Z0-9_]
).- Exceptions to this rule are
@idsNamespace
,@idsType
,@idsVersion
, and@idsConventionVersion
at the root level of the schema, and@link
anywhere in the schema.
- Exceptions to this rule are
- No two properties in
schema.json
may normalize to the same Athena column name.- For example, a property called
name
inside an objectperson
will correspond to an Athena column ofperson_name
. This means there cannot be another property calledperson_name
defined at the same level asperson
, becauseperson_name
would clash withperson.name
when mapping the data to Athena.
- For example, a property called
- No property may have the name
uuid
orparent_uuid
because these are reserved for use as Athena column names in TDP. - When properties are normalized to Athena column names, no column name can exceed 255 characters
elasticsearch.json
All fields defined under mapping.properties
in elasticsearch.json
must exist within the IDS.
There can only be up to 50 nested fields defined in the IDS.
athena.json
- Partition paths must correspond to valid properties in
schema.json
- Partition paths cannot point to properties anywhere inside an array
partition.name
cannot clash with any normalized property name fromschema.json
. For example, whenschema.json
properties are mapped to Athena, a propertyname
inside an objectperson
gets a normalized Athena column name ofperson_name
, meaningpartition.name
cannot beperson_name
because it would clash with theperson_name
Athena column.
expected.json
expected.json
must be a valid instance of schema.json
using a JSON Schema draft 7 validator.
Breaking change validation
Breaking change validation runs if a previous version of the IDS is passed using --previous_ids_dir
or --download
, see validating breaking changes.
A change to an IDS artifact is a breaking change if:
- Athena tables would be changed.
- IDS instances which were valid against the previous IDS version are not valid against the new version (excluding the top-level
@idsVersion
property, whoseconst
value changes between every IDS version).
The breaking change checks run by the validator are:
- Check that the two IDS artifacts have the same namespace and type, fail if they don't.
- The versions must either be equal (for a documentation-only change), or the new version must be a major/minor/patch bump of the previous version.
- Determine whether the two versions may have breaking changes, according to Semantic Versioning. For example, if the previous IDS was
v1.0.0
and the current IDS isv1.1.0
, then there should be no breaking changes. If the current IDS werev2.0.0
instead, then there may be breaking changes. - If breaking changes are not allowed according to the version change, validate that no breaking changes are included in the current version of the IDS:
schema.json
:- All property names and paths must be the same: no properties added, removed or renamed.
- The type of each property must be the same, or have
"null"
added if it is a primitive type (any type other than"array"
or"object"
). For example, a change from"type": "string"
to"type": ["string", "null"]
is not considered a breaking change. Note that the opposite, removing"null"
from the type, is a breaking change. - The list of required fields for each object must either stay the same or have items removed. For example, a change from
"required": ["name", "kind"]
to"required": ["name"]
is not a breaking change. Adding properties torequired
is a breaking change.
athena.json
:- Any change to athena.json is a breaking change (not including file formatting changes such as changing whitespace).
Note: When updating an invalid IDS, such as one which is missing a required artifact file, this breaking change validation may lead to an error because it expects the previous IDS to be valid. In this case, do not use the
--previous_ids_dir
or--download
options which both enable breaking change validation: just validate the newly updated version of the IDS and consider using a major version bump so that any problems caused by the previous IDS being invalid will not affect the updated IDS.For example, if the previous version of the IDS is missing
elasticsearch.json
, an exception will be raised during validation.
Tetra Data validation
These checks validate that Tetra Data conventions are being followed in this IDS's schema.json
.
These checks are enabled by including a property @idsConventionVersion
with a const
value of v1.0.0
in schema.json
.
Note that this Tetra Data specific validation will be removed from this package in a future version, so that it will only validate Tetra Data Platform requirements for IDS Artifacts.
- Properties of objects shouldn't begin with the same string as the object's name, for example an object called
method
shouldn't contain a property calledmethod_name
. - Property names should use snake case: entirely lower-case or numeric characters separated by single consecutive underscores.
related_files.pointer
'sfileId
andfileKey
properties are excluded from this check. So are any properties whose name starts with@
.
- For standard Tetra Data components, there is validation that the schema matches the expected component structure, including property names, types and
required
properties. This applies tosamples
,users
,systems
andrelated_files
. The validator output will explain any differences from the expected component structure if this check fails.
Changelog
v0.10.5
- Add
elasticsearch.json
validation to enforce a maximum of 50 nested fields - Add column resolution validation to ensure normalized Athena columns do not exceed 255 characters
v0.10.4
- Update the
--download
option to accept anignore_ssl
configuration (either an"ignore_ssl"
key in the JSON config, or theTS_IGNORE_SSL
environment variable). This makes the API configuration the same as whatts-sdk
accepts. See "Ignore SSL certificate verification" in the Readme for more information.
v0.10.3
- Update validation so it is mandatory to include
samples
andusers
fields along with specific definitions of each at the top level of Tetra Data IDSs to match documented conventions. This enables downstream use cases which depend on these fields always being present, even if they are not populated by any Tetra Data protocol. - Add ability to download previous IDS artifacts from the Tetra Data Platform for versioning validation, as an alternative to the
--previous_ids_dir
CLI option. This adds a--download
/-d
flag to the CLI, to specify that the previous IDS artifact should be downloaded from TDP. API configuration can be set using environment variables, or the--config
/-c
CLI option which takes a path to an API configuration JSON file. See the README for details.
v0.10.2
- Improve readability of versioning requirements and breaking change validation in validation output.
v0.10.1
- Remove unused dependencies which caused installation to fail in some Python versions.
v0.10.0
- Update how to use this package: add a CLI script
validate-ids-artifact
which is equivalent to the previous approach of runningpython -m ids_validator
. This previous approach still works as before but may be deprecated in a future version, so switching to the CLI script is recommended. - Add validation that property names do not contain special characters.
- Add validation for breaking changes between IDS versions which would be incompatible with the Tetra Data Platform
- Add the new CLI argument
--previous_ids_dir
(which may be omitted) which is the folder containing the previous version of the same IDS namespace and type. Without this CLI argument, breaking change validation does not run.
- Add the new CLI argument
- Add validation that
expected.json
is a valid instance ofschema.json
using a JSON Schema validator. - Remove
--version
CLI argument because this information can be retrieved from the IDS artifact being validated.
v0.9.16
- Remove the upper bound of what properties
samples
may contain for Tetra Data validation. This means thesamples
schema can now include properties other than the ones in thesamples
Tetra Data component, such as primary and foreign key fields.
v0.9.15
- Limit version of
typing-extensions
in dependencies to avoid a bug which causes the validator to always fail in Python 3.10 or later.
v0.9.14
- Update
samples[*]
check to optionally allow for it to contain a propertypk_samples
of type"string"
.
v0.9.13
related_files
is no longer checked against annotation fields like "description".
v0.9.12
- Update check for
samples[*].labels[*].source.name
type: previously the type was required to be"string"
, now it is required to be either["string", "null"]
or"string"
, with"string"
leading to a deprecation warning. This change makes thissource
definition the same assamples[*].properties[*].source
in a backward-compatible way.
v0.9.11
- Fix bug in
AthenaChecker
to allow root level IDS properties as partition paths. - Update
TypeChecker
to catch errors related to undefined/misspelledtype
key. - Update
jsonschema
version to fix package installation error
v0.9.10
- Modify
V1SnakeCaseChecker
to ignore checks for keys present indefinitions
object. - Add temporary allowance for
@link
in*.properties
v0.9.9
- Lock
jsonschema
version in requirements.txt
v0.9.8
- Modify
RulesChecker
to log missing and extra properties
v0.9.7
- Allow properties with
const
values to have non-nullabletype
v0.9.6
- Add checker classes for generic validation
- Add checker classes for v1.0.0 convention validation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ts_ids_validator-0.10.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73bd634b1ddc44e796dca84de4bb87c6e92e4c21224d2b410213b0c9e4b08b3b |
|
MD5 | 94f25f55daf9ec3f780159c25c83d604 |
|
BLAKE2b-256 | ca3fd8195f0711e661079eaa2a06667bddd4b45497885de07fde35808d284533 |