Package for schema assurance.

These details have not been verified by PyPI

Project description

Schema Validation

Validate your RDF data against an ontology and SHACL files - even when the data instance lacks datatypes definition.

1. Intro

The Resource Description Framework (RDF) is a method to describe and exchange graph data.

An RDF data instance can be complemented with two additional artefacts that enable RDF schema validation:

Ontology: describes the concepts and resources of a data instance using RDF language and allows for the detection of logically impossible assertions in the model.
SHACL (Shapes Constraint Language): a W3C standard that describes and validates the contents of RDF graphs instances.

Sometimes RDF data might lack datatypes definition. Under these circumstances, W3C states that when no datatype is specified, the data is by default considered Literal (the equivalent of a string).

These can be tricky. In that scenario, a standard validation will not produce the desired validation outcome. It can also be especially challenging when the data instance cannot be altered to comply with the W3C standard, be it by the standard definition (e.g., CIM), or by the inability to alter the source system to add the datatypes.

This is where the Schema Validation comes in. It dynamically injects the datatypes into the data instance, inferring the datatypes by navigating the hierarchy, and validates it against the desired ontologies and SHACL files.

1.1. Ontologies vs SHACLs

Ontologies provide inference rules, but they do not enforce constrains in the same way SHACLs do. Which means that the validation is only as good as the SHACL provided.

Taking this into account, the package will always prioritize the SHACLs when it comes to the datatypes definition, however, the inference of these datatypes will be created as a result of a cross-check between the ontologies and the SHACLs, when selected by the user.

This is why the package has two hunters: one for the ontologies and another one for the SHACLs. When there's conflicts between the results obtained by them, the ones produced by the SHACLs will be the ones considered.

2. How to Use - high level

There's three way to use the Schema Validation package: (1) default queries, (2) custom queries, and (3) list of custom queries that will be executed sequentially.

Option 1 - Default queries

The package has a set of default queries used to capture and extract the datatypes that will be leveraged if no query is provided by the user.

from rdflib import Graph

from dsi_schema_assurance import SchemaCertifier

data_graph = Graph()
data_graph.parse(data_path, format='xml')

shacl_graph = Graph()
shacl_graph.parse(shacl_path, format='turtle')

ont_graph = Graph()
ont_graph.parse(ont_path, format='xml')

validation_result = SchemaCertifier(
    data_graph=data_graph,
    shacl_graph=shacl_graph,
    ont_graph=ont_graph,
    inference_type="both"
).run()

Option 2 - Custom query

from rdflib import Graph

from dsi_schema_assurance import SchemaCertifier

data_graph = Graph()
data_graph.parse(data_path, format='xml')

shacl_graph = Graph()
shacl_graph.parse(shacl_path, format='turtle')

ont_graph = Graph()
ont_graph.parse(ont_path, format='xml')

ont_query = """
[CUSTOM QUERY]
"""

validation_result = SchemaCertifier(
    data_graph=data_graph,
    shacl_graph=shacl_graph,
    ont_graph=ont_graph,
    ont_query=ont_query,
    inference_type="both"
).run()

Option 3 - List of custom queries

from rdflib import Graph

from dsi_schema_assurance import SchemaCertifier

data_graph = Graph()
data_graph.parse(data_path, format='xml')

shacl_graph = Graph()
shacl_graph.parse(shacl_path, format='turtle')

ont_graph = Graph()
ont_graph.parse(ont_path, format='xml')

ont_query = [
    """
    [CUSTOM QUERY 1]
    """,
    """
    [CUSTOM QUERY 2]
    """,
]

validation_result = SchemaCertifier(
    data_graph=data_graph,
    shacl_graph=shacl_graph,
    ont_graph=ont_graph,
    ont_query=ont_query,
    inference_type="both"
).run()

Option 4 - Use queries from the utilities

from rdflib import Graph

from dsi_schema_assurance import SchemaCertifier

from dsi_schema_assurance.utils import default_dtypes_query
from dsi_schema_assurance.utils import default_primitive_query

data_graph = Graph()
data_graph.parse(data_path, format='xml')

shacl_graph = Graph()
shacl_graph.parse(shacl_path, format='turtle')

ont_graph = Graph()
ont_graph.parse(ont_path, format='xml')

validation_result = SchemaCertifier(
    data_graph=data_graph,
    ont_graph=ont_graph,
    ont_query=default_dtypes_query,
    ont_primitive_query=default_primitive_query,
    inference_type="both"
).run()

If you're using the CIM ontology, you can use the following utilities to get the default queries:

from dsi_schema_assurance.utils import get_cim_datatype_query
from dsi_schema_assurance.utils import get_cim_primitive_datatype_query

For further detail on these, you can check the folder.

Notes

a. The list of queries will be executed sequentially - if there's a query that returns an error, the whole process will be aborted;

b. The queries must always retain two fields: property and datatype for it to be considered valid, otherwise the Schema Validation tool will raise multiple errors because its modus operandi relies on that assumption.

Example of a valid custom query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cims: <http://iec.ch/TC57/1999/rdf-schema-extensions-19990926#>

SELECT DISTINCT ?property ?datatype ?stereotype
WHERE {
    ?property cims:dataType ?datatype .
    OPTIONAL {
        ?datatype cims:stereotype ?stereotype .
    }
}

Example of an invalid custom query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cims: <http://iec.ch/TC57/1999/rdf-schema-extensions-19990926#>

SELECT DISTINCT ?field ?dtype ?stereotype
WHERE {
    ?field cims:dataType ?dtype .
    OPTIONAL {
        ?dtype cims:stereotype ?stereotype .
    }
}

For more information - on how this validation can be used - you can check the demo folder. There you'll find a notebook that showcases the validation of CIM and other types of data that was used as a proof of concept.

Note: there's still some indefinition around how we're going to package all the modules together, which means that the way to use the Schema Validation might change in the future.

3. How it works under the hood

This solution divides itself into three main parts:

The Datatypes Hunter: extracts the datatypes from ontologies or SHACLs.
The Datatypes Injector: injects the datatypes into the RDF data instance.
The Validator: validates the RDF data instance against the ontologies and SHACL files.

Let's dive into each one of them.

3.1. Datatypes Hunter

As stated before, the datatypes hunter is the module that will be responsible for the following:

extract the datatypes by parsing either an ontology or a SHACL file;
the parsing is done via SPARQL queries;
the queries naviagate through a nested hierarchy like the one in the image below:

Additionally, it's important to retain something about it's implementation. We have an hunter that is fully dedicated for the ontologies, and another one for the SHACL files.

This is because the ontologies and the SHACL files have different structures, and therefore, different strategies to extract the datatypes.

3.2. Datatypes Injector

Single function that injects the datatypes, collected by the DatatypeHunter, to the RDF data instance.

3.3. Validator

Last but not least, the validator script contains the SchemaCertifier class. This class is the one that will be responsible for the validation of the data.

Going back to module mentioned (Datatypes Hunter), this module will be responsible for concialiating the outcomes of the hunters (if the user provides both ontologies and SHACL files).

By default, the SHACLs are more assertive when it comes to datatypes definition, and therefore, the datatypes obtained from these will be the ones responsible for dictating the final datatypes.

This module leverages the pyshacl to perform the validation.

The user can choose the specs that will be leveraged on the validation process.

A1. Tests and Coverage

To run the tests and coverage, use the following command:

❯ coverage run -m unittest discover tests/

the current coverage report is the following:

Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
dsi_schema_assurance/detectors/ontology.py      38      0   100%
dsi_schema_assurance/detectors/shacl.py         12      0   100%
dsi_schema_assurance/injector.py                17      0   100%
dsi_schema_assurance/utils/ddl/cim.py            2      0   100%
dsi_schema_assurance/utils/ddl/default.py        3      0   100%
dsi_schema_assurance/utils/ddl/dl.py             1      0   100%
dsi_schema_assurance/utils/pandas.py            25      0   100%
dsi_schema_assurance/validator.py              148      1    99%
----------------------------------------------------------------
TOTAL                                          246      1    99%

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.4

Mar 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsi_schema_assurance-1.0.4.tar.gz (20.3 kB view details)

Uploaded Mar 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dsi_schema_assurance-1.0.4-py3-none-any.whl (20.7 kB view details)

Uploaded Mar 12, 2025 Python 3

File details

Details for the file dsi_schema_assurance-1.0.4.tar.gz.

File metadata

Download URL: dsi_schema_assurance-1.0.4.tar.gz
Upload date: Mar 12, 2025
Size: 20.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.2.2 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for dsi_schema_assurance-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`940f53916d1161e3ae473beffbbcfe4ef07f5d573ffb62481dc3f21dda22d6db`
MD5	`57dca104a6cc817c1d82d90f90d1b9f2`
BLAKE2b-256	`aad9ccf472d563de99b043a8766a2555ce6826e1313aa772a07824d9bfca2cb6`

See more details on using hashes here.

File details

Details for the file dsi_schema_assurance-1.0.4-py3-none-any.whl.

File metadata

Download URL: dsi_schema_assurance-1.0.4-py3-none-any.whl
Upload date: Mar 12, 2025
Size: 20.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.2.2 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for dsi_schema_assurance-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e11fef3e202fa10b9b87889b03071794fb8121273f6a75bdacfb9fdf9b0ad3f`
MD5	`042b105e0b4624690fc66080f27de165`
BLAKE2b-256	`90f569e2f2264c953171918ffa677eb8d5af12f58e026062838a39a154fb2dd3`

See more details on using hashes here.

dsi-schema-assurance 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Schema Validation

1. Intro

1.1. Ontologies vs SHACLs

2. How to Use - high level

3. How it works under the hood

3.1. Datatypes Hunter

3.2. Datatypes Injector

3.3. Validator

A1. Tests and Coverage

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes