Skip to main content

A software to validate CSV documents storing citation data and bibliographic metadata according to the OpenCitations Data Model.

Project description

oc_validator

oc_validator is a Python (≥3.9) library to validate CSV documents storing citation data and bibliographic metadata. To be processed by the validator, the tables must be built as either CITS-CSV or META-CSV tables, defined in two specification documents[^1][^2].

[^1]: Massari, Arcangelo, and Ivan Heibi. 2022. ‘How to Structure Citations Data and Bibliographic Metadata in the OpenCitations Accepted Format’. https://doi.org/10.48550/arXiv.2206.03971.

[^2]: Massari, Arcangelo. 2022. ‘How to Produce Well-Formed CSV Files for OpenCitations’. https://doi.org/10.5281/zenodo.6597141.

Installation

The library can be installed from pip:

pip install oc_validator

Usage

The validation process can be executed from the CLI by running the following command:

python -m oc_validator.main -i <input csv file path> -o <output dir path> [-m] [-s]

Required Parameters

  • -i, --input: The path to the CSV file to validate.
  • -o, --output: The path to the directory where the output JSON file and .txt file will be stored.

Optional Parameters

  • -m, --use-meta: Enables the use of the OC Meta endpoint instead of external APIs to check if an ID exists (by checking if it is registered in OpenCitations Meta). If included, this option allows to fasten the whole process, since querying Meta is faster than querying external APIs, but results might not be the most up to date.
  • -s, --no-id-existence: Skips the check for ID existence altogether, ensuring that neither the Meta endpoint nor any external APIs are used during validation. This allows for a much shorter execution time, but does not make sure that all the submitted IDs actually refer to real-world entities.

Example Usage from CLI

To validate a CSV file and output the results to a specified directory (with optional parameters set to default values, i.e. checking for the existence of IDs via querying external APIs):

python -m oc_validator.main -i path/to/input.csv -o path/to/output_dir

To use OC Meta endpoint instead of external APIs to verify the existence of the IDs:

python -m oc_validator.main -i path/to/input.csv -o path/to/output_dir -m

To skip all ID existence verification:

python -m oc_validator.main -i path/to/input.csv -o path/to/output_dir -s

Programmatic Usage

An object of the Validator class is instantiated, passing as parameters the path to the input document to validate and the path to the directory where to store the output. By calling the validate() method on the instance of Validator, the validation process gets executed.

The process automatically detects which of the two tables has been passed as input (on condition that the input CSV document's header is formatted correctly for at least one of them). During the process, the whole document is always processed: if the document is invalid or contains anomalies, the errors/warnings are reported in detail in a JSON file and summarized in a .txt file, which will be automatically created in the output directory. validate also returns a list of dictionaries corresponding to the JSON validation report (empty if the document is valid).

from oc_validator.main import Validator

# Basic validation
v = Validator('path/to/table.csv', 'output/directory')
v.validate()

# Validation with Meta endpoint checking for ID existence
v = Validator('path/to/table.csv', 'output/directory', use_meta_endpoint=True)
v.validate()

# Validation skipping all ID existence checks
v = Validator('path/to/table.csv', 'output/directory', verify_id_existence=False)
v.validate()

Starting from version 0.3.3, it is possible to validate two tables at a time, one storing metadata and the other storing citations, in order to verify, besides all the other checks, that all the citations represented in a document have their metadata represented in the other document, and vice versa. This can be done by using the ClosureValidator class. The ClosureValidator class internally wraps two instances of Validator, one for metadata and one for citations, and requires to explicitly specify the table type for either document. Both the internal Validator instances can be separately customized by specifying the optional parameters for each of the two via the meta_kwargs and cits_kwargs arguments. ClosureValidator takes the following parameters:

  • meta_in: Path to the input CSV table storing metadata.
  • meta_out_dir: Directory for metadata validation results.
  • cits_in: Path to the input CSV table storing citations.
  • cits_out_dir: Directory for citation validation results.
  • strict_sequenciality: [deafaults to False] If True, checks the transitive closure if and only if all the other checks passed without detecting errors. With the default option (False), it is always checked that all the entities involved in citations have also their metadata represented in the other table, and vice versa, regardless of the presence of other errors in the tables.
  • meta_kwargs: (Optional) Dictionary of configuration options for the metadata table validator.
  • cits_kwargs: (Optional) Dictionary of configuration options for the citation table validator.

A usage example of how to validate metadata and citations with ClosureValidator is provided as follows:

from oc_validator.main import ClosureValidator

cv = ClosureValidator(
    meta_in='path/to/meta.csv', 
    meta_out_dir='path/to/meta_results',
    cits_in='path/to/cits.csv', 
    cits_out_dir='path/to/cits_results',
    meta_kwargs={'verify_id_existence': False},  # Skip ID existence checks for metadata
    cits_kwargs={'use_meta_endpoint': True}  # Use OC Meta before external APIs to verify the existence of PIDs
)

cv.validate() # validates the tables and saves output files in the specified (separate) directories

Output visualisation

oc_validator has an integrated tool for the interactive visualisation of the validation results, which helps users to locate the detected errors in the document in a more intuitive way, and facilitates human understanding of the underlying problems generating them. The documentation of the visualisation tool can be found in oc_validator/interface/README.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oc_validator-0.3.8.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oc_validator-0.3.8-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file oc_validator-0.3.8.tar.gz.

File metadata

  • Download URL: oc_validator-0.3.8.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10

File hashes

Hashes for oc_validator-0.3.8.tar.gz
Algorithm Hash digest
SHA256 406728d5a4d06998bcf20ab0f3e3ac5eacf1374fb029acc6b0531bf4795b065a
MD5 1f54fad53f93b64d437a5f62558ad8a9
BLAKE2b-256 c2a7c4fc139493a2a5f06c731b41fd3a6d1b80d09878fad49e07a96ff819d0d2

See more details on using hashes here.

File details

Details for the file oc_validator-0.3.8-py3-none-any.whl.

File metadata

  • Download URL: oc_validator-0.3.8-py3-none-any.whl
  • Upload date:
  • Size: 38.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.9 Windows/10

File hashes

Hashes for oc_validator-0.3.8-py3-none-any.whl
Algorithm Hash digest
SHA256 1fd27414a734fec399717932a557b8b0d9774250046926aae86abb331b0d103a
MD5 eead9b930bf9c00e1d535a63bcca7e82
BLAKE2b-256 a84059f1b16122398c69a93c9a92b8c8bb9142a2dd177027f3a92e86a1f8c8cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page