Skip to main content

Package for schema assurance.

Project description

Schema Validation

Validate your RDF data against an ontology and SHACL files - even when the data instance lacks datatypes definition.

coverage tests sbt_version PyPI

1. Intro

The Resource Description Framework (RDF) is a method to describe and exchange graph data.

An RDF data instance can be complemented with two additional artefacts that enable RDF schema validation:

  • Ontology: describes the concepts and resources of a data instance using RDF language and allows for the detection of logically impossible assertions in the model.
  • SHACL (Shapes Constraint Language): a W3C standard that describes and validates the contents of RDF graphs instances.

Sometimes RDF data might lack datatypes definition. Under these circumstances, W3C states that when no datatype is specified, the data is by default considered Literal (the equivalent of a string).

These can be tricky. In that scenario, a standard validation will not produce the desired validation outcome. It can also be especially challenging when the data instance cannot be altered to comply with the W3C standard, be it by the standard definition (e.g., CIM), or by the inability to alter the source system to add the datatypes.

This is where the Schema Validation comes in. It dynamically injects the datatypes into the data instance, inferring the datatypes by navigating the hierarchy, and validates it against the desired ontologies and SHACL files.

1.1. Ontologies vs SHACLs

Ontologies provide inference rules, but they do not enforce constrains in the same way SHACLs do. Which means that the validation is only as good as the SHACL provided.

Taking this into account, the package will always prioritize the SHACLs when it comes to the datatypes definition, however, the inference of these datatypes will be created as a result of a cross-check between the ontologies and the SHACLs, when selected by the user.

This is why the package has two hunters: one for the ontologies and another one for the SHACLs. When there's conflicts between the results obtained by them, the ones produced by the SHACLs will be the ones considered.

2. How to Use - high level

There's three way to use the Schema Validation package: (1) default queries, (2) custom queries, and (3) list of custom queries that will be executed sequentially.

Option 1 - Default queries

The package has a set of default queries used to capture and extract the datatypes that will be leveraged if no query is provided by the user.

from rdflib import Graph

from dsi_schema_assurance import SchemaCertifier

data_graph = Graph()
data_graph.parse(data_path, format='xml')

shacl_graph = Graph()
shacl_graph.parse(shacl_path, format='turtle')

ont_graph = Graph()
ont_graph.parse(ont_path, format='xml')

validation_result = SchemaCertifier(
    data_graph=data_graph,
    shacl_graph=shacl_graph,
    ont_graph=ont_graph,
    inference_type="both"
).run()

Option 2 - Custom query

from rdflib import Graph

from dsi_schema_assurance import SchemaCertifier

data_graph = Graph()
data_graph.parse(data_path, format='xml')

shacl_graph = Graph()
shacl_graph.parse(shacl_path, format='turtle')

ont_graph = Graph()
ont_graph.parse(ont_path, format='xml')

ont_query = """
[CUSTOM QUERY]
"""

validation_result = SchemaCertifier(
    data_graph=data_graph,
    shacl_graph=shacl_graph,
    ont_graph=ont_graph,
    ont_query=ont_query,
    inference_type="both"
).run()

Option 3 - List of custom queries

from rdflib import Graph

from dsi_schema_assurance import SchemaCertifier

data_graph = Graph()
data_graph.parse(data_path, format='xml')

shacl_graph = Graph()
shacl_graph.parse(shacl_path, format='turtle')

ont_graph = Graph()
ont_graph.parse(ont_path, format='xml')

ont_query = [
    """
    [CUSTOM QUERY 1]
    """,
    """
    [CUSTOM QUERY 2]
    """,
]

validation_result = SchemaCertifier(
    data_graph=data_graph,
    shacl_graph=shacl_graph,
    ont_graph=ont_graph,
    ont_query=ont_query,
    inference_type="both"
).run()

Option 4 - Use queries from the utilities

from rdflib import Graph

from dsi_schema_assurance import SchemaCertifier

from dsi_schema_assurance.utils import default_dtypes_query
from dsi_schema_assurance.utils import default_primitive_query

data_graph = Graph()
data_graph.parse(data_path, format='xml')

shacl_graph = Graph()
shacl_graph.parse(shacl_path, format='turtle')

ont_graph = Graph()
ont_graph.parse(ont_path, format='xml')

validation_result = SchemaCertifier(
    data_graph=data_graph,
    ont_graph=ont_graph,
    ont_query=default_dtypes_query,
    ont_primitive_query=default_primitive_query,
    inference_type="both"
).run()

If you're using the CIM ontology, you can use the following utilities to get the default queries:

from dsi_schema_assurance.utils import get_cim_datatype_query
from dsi_schema_assurance.utils import get_cim_primitive_datatype_query

For further detail on these, you can check the folder.

Notes

a. The list of queries will be executed sequentially - if there's a query that returns an error, the whole process will be aborted;

b. The queries must always retain two fields: property and datatype for it to be considered valid, otherwise the Schema Validation tool will raise multiple errors because its modus operandi relies on that assumption.

Example of a valid custom query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cims: <http://iec.ch/TC57/1999/rdf-schema-extensions-19990926#>

SELECT DISTINCT ?property ?datatype ?stereotype
WHERE {
    ?property cims:dataType ?datatype .
    OPTIONAL {
        ?datatype cims:stereotype ?stereotype .
    }
}

Example of an invalid custom query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cims: <http://iec.ch/TC57/1999/rdf-schema-extensions-19990926#>

SELECT DISTINCT ?field ?dtype ?stereotype
WHERE {
    ?field cims:dataType ?dtype .
    OPTIONAL {
        ?dtype cims:stereotype ?stereotype .
    }
}

For more information - on how this validation can be used - you can check the demo folder. There you'll find a notebook that showcases the validation of CIM and other types of data that was used as a proof of concept.

Note: there's still some indefinition around how we're going to package all the modules together, which means that the way to use the Schema Validation might change in the future.

3. How it works under the hood

This solution divides itself into three main parts:

  • The Datatypes Hunter: extracts the datatypes from ontologies or SHACLs.
  • The Datatypes Injector: injects the datatypes into the RDF data instance.
  • The Validator: validates the RDF data instance against the ontologies and SHACL files.

Let's dive into each one of them.

3.1. Datatypes Hunter

As stated before, the datatypes hunter is the module that will be responsible for the following:

  • extract the datatypes by parsing either an ontology or a SHACL file;
  • the parsing is done via SPARQL queries;
  • the queries naviagate through a nested hierarchy like the one in the image below:

Additionally, it's important to retain something about it's implementation. We have an hunter that is fully dedicated for the ontologies, and another one for the SHACL files.

This is because the ontologies and the SHACL files have different structures, and therefore, different strategies to extract the datatypes.

3.2. Datatypes Injector

Single function that injects the datatypes, collected by the DatatypeHunter, to the RDF data instance.

3.3. Validator

Last but not least, the validator script contains the SchemaCertifier class. This class is the one that will be responsible for the validation of the data.

Going back to module mentioned (Datatypes Hunter), this module will be responsible for concialiating the outcomes of the hunters (if the user provides both ontologies and SHACL files).

By default, the SHACLs are more assertive when it comes to datatypes definition, and therefore, the datatypes obtained from these will be the ones responsible for dictating the final datatypes.

This module leverages the pyshacl to perform the validation.

The user can choose the specs that will be leveraged on the validation process.


A1. Tests and Coverage

To run the tests and coverage, use the following command:

 coverage run -m unittest discover tests/

the current coverage report is the following:

Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
dsi_schema_assurance/detectors/ontology.py      38      0   100%
dsi_schema_assurance/detectors/shacl.py         12      0   100%
dsi_schema_assurance/injector.py                17      0   100%
dsi_schema_assurance/utils/ddl/cim.py            2      0   100%
dsi_schema_assurance/utils/ddl/default.py        3      0   100%
dsi_schema_assurance/utils/ddl/dl.py             1      0   100%
dsi_schema_assurance/utils/pandas.py            25      0   100%
dsi_schema_assurance/validator.py              148      1    99%
----------------------------------------------------------------
TOTAL                                          246      1    99%

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsi_schema_assurance-1.0.4.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsi_schema_assurance-1.0.4-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file dsi_schema_assurance-1.0.4.tar.gz.

File metadata

  • Download URL: dsi_schema_assurance-1.0.4.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for dsi_schema_assurance-1.0.4.tar.gz
Algorithm Hash digest
SHA256 940f53916d1161e3ae473beffbbcfe4ef07f5d573ffb62481dc3f21dda22d6db
MD5 57dca104a6cc817c1d82d90f90d1b9f2
BLAKE2b-256 aad9ccf472d563de99b043a8766a2555ce6826e1313aa772a07824d9bfca2cb6

See more details on using hashes here.

File details

Details for the file dsi_schema_assurance-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for dsi_schema_assurance-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5e11fef3e202fa10b9b87889b03071794fb8121273f6a75bdacfb9fdf9b0ad3f
MD5 042b105e0b4624690fc66080f27de165
BLAKE2b-256 90f569e2f2264c953171918ffa677eb8d5af12f58e026062838a39a154fb2dd3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page