No project description provided

These details have not been verified by PyPI

Project description

SPARQL Data Quality Monitoring Tool

This software provides tools to monitor the quality of data in OpenCitation Meta and OpenCitations Index by querying their SPARQL endpoints. It includes two monitoring classes: MetaMonitor and IndexMonitor, which are designed to run a series of tests on the triplestores and generate reports on data quality issues.

The reports are generated in JSON format, which is then converted to HTML for easier visualization.

SPARQL Data Quality Monitoring Tool

Overview

The tool uses SPARQL queries to check for potential issues in the data stored in the specified SPARQL endpoints:

MetaMonitor is used for the OpenCitations Meta endpoint.
IndexMonitor is used for the OpenCitations Index endpoint.

Each monitor runs a series of pre-configured tests defined in the related configuration file (in JSON format). The tests produce a report (JSON file) that details which issues are detected, including metadata such as runtime, whether the tests passed or failed, and any errors encountered during the process. The JSON report is then converted into HTML to be read more easily.

Two classes are responsible for interrogating the triplestores, the MetaMonitor class for OpenCitations Meta and the IndexMonitor class for OpenCitations Index, which can both be found inside the data_monitor module. They both require, upon instantiation, the path to the appropriate configuration file and the base path for the output files. In both the classes, the run_tests() method actually interrogates the endpoint specified in the related config file and produces the JSON output.

The JSON output can be converted into an HTML page by using the generate_html() method of the ReportVisualiser class, inside the html_vis module.

Installation

Clone the repository:

git clone https://github.com/opencitations/oc_monitor.git

Install dependencies:

The project's dependencies and virtual environment are managed with Poetry. If you're already using Poetry and have installed on your machine, you can use it to create a virtual enviroment by simply running:
```
poetry install
```
and then, to activate it:
```
poetry shell
```
If you're not using Poetry, you can install the required Python libraries by using pip and the requirements.txt file on your preferred environment:
```
pip install -r requirements.txt
```
Ensure proper configuration files:

Make sure you have the necessary configuration files (e.g., meta_monitor_config.json and index_monitor_config.json) in the project folder. See the section on Configuration Files for details. The configuration files provided in this repository should work out of the box.

Usage

To run the process with the default configuration:

cd oc_monitor
python -m main

Configuration Files

The configuration files for both MetaMonitor and IndexMonitor are in JSON format and contain details about the endpoint to be queried and the tests to run. The endpoint field stores the URL of the endpoint to interrogate. The fields for each test include:

label: A short name for the tested issue.
description: A brief description of the issue.
query: The SPARQL query used to perform the check.
to_run: A boolean flag (true or false) indicating whether to run this specific test.

Example configuration (custom_meta_monitor_config.json):

{
    "endpoint": "https://k8s.opencitations.net/meta/sparql",
    "tests": [
        {
            "label": "duplicate_br",
            "to_run": true,
            "description": "A single value for a given external ID scheme (e.g. DOI value) is associated with more than one BR.",
            "query": "PREFIX datacite: <http://purl.org/spar/datacite/>\nPREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>\nPREFIX fabio: <http://purl.org/spar/fabio/>\n\nASK {\n    ?br1 datacite:hasIdentifier/literal:hasLiteralValue ?lit ;\n    a fabio:Expression .\n    ?br2 datacite:hasIdentifier/literal:hasLiteralValue ?lit ;\n    a fabio:Expression .\n    FILTER(?br1 != ?br2)\n}"
        }
    ]
}

Command-Line Arguments

The script allows users to further customise the default behaviour of the software via command-line arguments. Here are the available options:

Argument	Description	Default Value
`--meta_config`	Filepath for the MetaMonitor configuration file.	`meta_monitor_config.json`
`--index_config`	Filepath for the IndexMonitor configuration file.	`index_monitor_config.json`
`--run`	Specify which monitor to run: `meta`, `index`, or `both`.	`both`
`--output_base_path`	Base folder for reports output. The folder structure follows `monitor_results/<meta_reports\|index_reports>/<YYYYMMDD>/`.	`monitor_results`

Examples

Run Both MetaMonitor and IndexMonitor (Default)

To run both monitors using default configurations and output paths:

cd oc_monitor
python -m main

This will generate reports in:

results/meta_reports/YYYYMMDD/
results/index_reports/YYYYMMDD/

Run Only MetaMonitor

To run only the MetaMonitor and specify a custom configuration file:

cd oc_monitor
python -m main --run meta --meta_config my_meta_config.json

Custom Output Path

To run both monitors but specify a custom output base path:

cd oc_monitor
python -m main --output_base_path /my/custom/path

The reports will be saved in:

/my/custom/path/meta_reports/YYYYMMDD/
/my/custom/path/index_reports/YYYYMMDD/

Run Only IndexMonitor with Custom Paths

To run only the IndexMonitor and specify both the configuration file and a custom output path:

cd oc_monitor
python -m main --run index --index_config custom_index_config.json --output_base_path custom_reports

The reports will be saved in custom_reports/index_reports/YYYYMMDD/.

Output Structure

The output reports are stored in a folder structure that follows this pattern:

monitor_results/
  ├── meta_reports/
  │    └── YYYYMMDD/
  │         ├── output_meta_monitor_YYYYMMDD.json
  │         └── meta_monitor_vis_YYYYMMDD.html
  └── index_reports/
       └── YYYYMMDD/
            ├── output_index_monitor_YYYYMMDD.json
            └── index_monitor_vis_YYYYMMDD.html

The JSON output file stores information about the tests results along with details on the execution process (associated configuration file, date and time of the execution, runtime, raised errors, etc.). Each test result in the output file is associated with the label and description of the issue and the SPARQL query that has been run for the test itself.

Example JSON output (output_meta_monitor_20241020.json):

{
    "endpoint": "https://k8s.opencitations.net/meta/sparql",
    "collection": "OpenCitations Meta",
    "datetime": "20/10/2024, 17:29:10",
    "running_time": 1.0028636455535889,
    "config_fp": "custom_meta_monitor_config.json",
    "monitoring_results": [
        {
            "label": "duplicate_br",
            "description": "A single value for a given external ID scheme (e.g. DOI value) is associated with more than one BR.",
            "query": "query": "PREFIX datacite: <http://purl.org/spar/datacite/>\nPREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>\nPREFIX fabio: <http://purl.org/spar/fabio/>\n\nASK {\n    ?br1 datacite:hasIdentifier/literal:hasLiteralValue ?lit ;\n    a fabio:Expression .\n    ?br2 datacite:hasIdentifier/literal:hasLiteralValue ?lit ;\n    a fabio:Expression .\n    FILTER(?br1 != ?br2)\n}"
            "run": {
                "got_result": true,
                "running_time": 1.0028636455535889,
                "error": null
            },
            "passed": false
        }
    ]
}

The above JSON report is then converted into an HTML document (although some information is left out, e.g. the SPARQL query for each test) and stored in the same directory:

alt text

Filename Details

JSON report: output_<monitor_type>_YYYYMMDD.json
HTML report: <monitor_type>_monitor_vis_YYYYMMDD.html

If the script is run multiple times on the same day, the filenames of the files created after the first one will be versioned (e.g., output_meta_monitor_YYYYMMDD_1.json, meta_monitor_vis_YYYYMMDD_1.html, etc.).

License

This project is licensed under the ISC License. See the LICENSE.md file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Jun 26, 2025

0.1.1

Apr 15, 2025

This version

0.1.0

Apr 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oc_monitor-0.1.0.tar.gz (10.6 kB view details)

Uploaded Apr 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oc_monitor-0.1.0-py3-none-any.whl (13.4 kB view details)

Uploaded Apr 2, 2025 Python 3

File details

Details for the file oc_monitor-0.1.0.tar.gz.

File metadata

Download URL: oc_monitor-0.1.0.tar.gz
Upload date: Apr 2, 2025
Size: 10.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.9 Windows/10

File hashes

Hashes for oc_monitor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f5baeb0848b75757410918e07fad10e3daa689f70088272a1b65e34d7391aace`
MD5	`e503b76b2f67c22599f2d694d32b8c22`
BLAKE2b-256	`f367ee4fd5d226315a620e798fa898dcb88210cd0a7d46f98a1ef8a48c47b6db`

See more details on using hashes here.

File details

Details for the file oc_monitor-0.1.0-py3-none-any.whl.

File metadata

Download URL: oc_monitor-0.1.0-py3-none-any.whl
Upload date: Apr 2, 2025
Size: 13.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.9 Windows/10

File hashes

Hashes for oc_monitor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f985cc0dde19b5f7b2b3140865b06946ff96a0ce0df8364e949b682fb5ae1c59`
MD5	`e4f8a1da332bb0906b0d0948fed9347f`
BLAKE2b-256	`c0e7b8c4cee9f56f45229c6ef9f9650ac89289116a00f6e2f99c194e75ad608e`

See more details on using hashes here.

oc_monitor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

SPARQL Data Quality Monitoring Tool

Table of Contents

Overview

Installation

Usage

Configuration Files

Command-Line Arguments

Examples

Run Both MetaMonitor and IndexMonitor (Default)

Run Only MetaMonitor

Custom Output Path

Run Only IndexMonitor with Custom Paths

Output Structure

Filename Details

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes