Skip to main content

Serialize Tripal's JSON-LD API into RDF..

Project description

PyTripalSerializer

Documentation Status build & test

Serialize Tripal's JSON-LD API into RDF format

This package implements a recursive algorithm to parse the JSON-LD API of a Tripal genomic database webservice and serialize the encountered terms into a RDF document. Output will be saved in a turtle file (.ttl).

Motivation

This work is a byproduct of a data integration project for multiomics data at MPI for Evolutionary Biology. Among various other data sources, we run an instance of the Tripal genomic database website engine. This service provides a JSON-LD API, i.e., all data in the underlying relational database is accessible through appropriate http GET requests against that API. So far so good. Now, in our project, we are working on integrating data based on Linked Data technology; in particular, all data sources should be accessible via (federated) SPARQL queries. Hence, the task is to convert the JSON-LD API into a SPARQL endpoint.

The challenge here is that the JSON-LD API only provides one document at a time. Querying a single document with e.g. the arq utility (part of the Apache-Jena package) is no problem. The problem starts when one then attempts to run queries against other JSON-LD documents referenced in the first document as object URIs but. These object URIs are not part of the current document (graph). Instead, they point to separate graph. SPARQL in its current implementation does not support dynamic generation of graph URIs from e.g. object URIs. Hence the need for a code that recursively parses a JSON-LD document including all referenced documents.

Of course this is a generic problem. This package implements a solution targeted for Tripal JSON-LD APIs but with minimal changes it should be adaptable for other JSON-LD APIs.

Installation

PyPI Releases

This package is released via the Python Package Index (PyPI). To install it, run

$ pip install pytripalserializer

Github development snapshot

To install the latest development snapshot from github, clone this repository

git clone https://github.com/mpievolbio-scicomp/PyTripalSerializer

Navigate into the cloned directory and run a local pip install:

cd PyTripalSerializer
pip install [-e] .

The optional flag -e would instruct pip to install symlinks to the source files, this is recommended for developers.

Usage

The simplest way to use the package is via the command line interface. The following example should be instructive enough to get started:

$ cd PyTripalSerializer
$ cd src
$ ./tripser http://pflu.evolbio.mpg.de/web-services/content/v0.1/CDS/11846 -o cds11846.ttl

Running this command should produce the RDF turtle file "cds11846.ttl" in the src/ directory. "cds11846" has only 42 triples.

Be aware that running the command on a top level URL such as http://pflu.evolbio.mpg.de/web-services/content/v0.1/ would parse the entire tree of documents which results in a graph of ~2 million triples and takes roughly 14hrs to complete on a reasonably well equipped workstation with 48 CPUs.

Testing

Run the test suite with

pytest tests

Documentation

Click the documentation badge at the top of this README to access the online manual.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytripalserializer-0.0.3.tar.gz (17.0 kB view details)

Uploaded Source

File details

Details for the file pytripalserializer-0.0.3.tar.gz.

File metadata

  • Download URL: pytripalserializer-0.0.3.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for pytripalserializer-0.0.3.tar.gz
Algorithm Hash digest
SHA256 f7ad36a8a3a3a0f28ae85427b5ab780cb0ea878fcb6334238c8d2434e40f1e51
MD5 5aaa5d91e28dfa1e88b8deb9ab1f2e7e
BLAKE2b-256 8a37ae94f9ac0ddcd1092fb294842b3b83a9ca60940584d979a22bf8ecfde28d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page