Skip to main content

dtool lookup server plugin for querying dataset dependency graphs

Project description

GitHub Workflow Status PyPI GitHub tag (latest by date)

Features

  • Use a dataset UUID to lookup all datasets within the same dependency graph

  • Build views on dataset dependency trees based on arbitrary connecting keys

Introduction

dtool is a command line tool for packaging data and metadata into a dataset. A dtool dataset manages data and metadata without the need for a central database.

However, if one has to manage more than a hundred datasets it can be helpful to have the datasets’ metadata stored in a central server to enable one to quickly find datasets of interest.

The dtool-lookup-server provides a web API for registering datasets’ metadata and provides functionality to lookup, list and search for datasets.

This plugin enables the dtool-lookup-server to directly provide all datasets within a specific dependency graph.

Installation

Install the dtool lookup server dependency graph plugin

$ pip install dtool-lookup-server-dependency-graph-plugin

Setup and configuration

Configure plugin behavior

With

export DTOOL_LOOKUP_SERVER_ENABLE_DEPENDENCY_VIEW=True

the underlying database will offer a view on the default collection. This view offers an on-the-fly-generated collection of undirected per-dataset adjacency lists in order to facilitate searching dataset dependeny graphs in both directions. With

export DTOOL_LOOKUP_SERVER_FORCE_REBUILD_DEPENDENCY_VIEW=True

this view is reestablished at every query. This is required to apply changes to related options, such as the JSON-formatted list

export DTOOL_LOOKUP_SERVER_DEPENDENCY_KEYS='["readme.derived_from.uuid", "annotations.source_dataset_uuid"]'

which indicates at which keys to look for source dataset(s) by UUID. The example above illustrates the default. All keys are treated equivalently and nested fields are separated by the dot (.). The actual nesting hierarchy does not matter. This means, for example, both structures evaluate equivalently in the following

{'readme': {'derived_from': {'uuid':
    ['8ecd8e05-558a-48e2-b563-0c9ea273e71e',
     'faa44606-cb86-4877-b9ea-643a3777e021']}}}

{'readme': {'derived_from':
    [{'uuid': '8ecd8e05-558a-48e2-b563-0c9ea273e71e'},
     {'uuid': 'faa44606-cb86-4877-b9ea-643a3777e021'}]}}

Setting

export DTOOL_LOOKUP_SERVER_DYNAMIC_DEPENDENCY_KEYS=True

will allow the client side to request graphs spanned by arbitrary dependency keys (see below). The related options

export DTOOL_LOOKUP_SERVER_MONGO_DEPENDENCY_VIEW_PREFIX=dep
export DTOOL_LOOKUP_SERVER_MONGO_DEPENDENCY_VIEW_BOOKKEEPING=dep_views
export DTOOL_LOOKUP_SERVER_MONGO_DEPENDENCY_VIEW_CACHE_SIZE=10

control internal behavior. See source code and examples below.

Note that the above exports containing JSON syntax are formatted for usage in bash. Enclosing single quotes are not to be part of the actual variable value when environment variables are configured elsewhere.

The dtool lookup server API

The dtool lookup server makes use of the Authorized header to pass through the JSON web token for authorization. Below we create environment variables for the token and the header used in the curl commands

$ TOKEN=$(flask user token olssont)
$ HEADER="Authorization: Bearer $TOKEN"

Standard user usage

Looking up dependency graphs based on a dataset’s UUID

A dataset can be derived from one or several source datasets, usually by machine-generated annotations attached via the Python API at dataset creation time, or manually by recording the UUIDs of parent datasets in some arbitrary fields within the README.yml. If configured appropriately, querying the server directly for all datasets within the same dependency graph by UUID is possible, i.e.

$ UUID=41a2e3e2-0c01-444f-bd7d-f9bb45512373
$ curl -H "$HEADER" http://localhost:5000/graph/lookup/$UUID

Looking up a dependency graph by UUID will result in unique per-UUID hits. As it is possible for a dataset to be registered in more than one base URI, the query will yield one arbitrary hit in such a case.

Looking up graphs spanned by arbitrary dependency keys

If DTOOL_LOOKUP_SERVER_DYNAMIC_DEPENDENCY_KEYS=True, then the client may ask the server to explore a graph spanned by dependency keys differing from the server-side defaults in DTOOL_LOOKUP_SERVER_DEPENDENCY_KEYS. This happens as above, but with via a POST request with a JSON-formatted list of desired dependency keys attached

$ curl -H "$HEADER" -H "Content-Type: application/json"  \
    -X POST -d  \
    '["annotations.source_dataset_uuid","readme.derived_from.uuid"]'
    http://localhost:5000/graph/lookup/$UUID

If a view for this particular set of keys does not exist yet, the server will generate and cache it on-the-fly. This can be observed in the mongo shell

$ mongo

> show dbs
admin       0.000GB
config      0.000GB
dtool_info  0.020GB
local       0.000GB

> use dtool_info
switched to db dtool_info

> show collections
datasets
dep:2020-10-05T01:22:39.581592
dep:2020-10-06T21:45:00.525410
dep:2020-10-06T21:45:28.495903
dep_views
dependencies
system.views

Here, all dep-prefixed collections are dependency views for distinct sets of keys. The bookkeeping collection``dep_views`` holds records of all dependency view - key set mappings together with the latest access

> db.dep_views.find()
{ "_id" : ObjectId("5f7a755faea9fcf239f68dba"), "name" : "dep:2020-10-05T01:22:39.581592", "keys" : [ "annotations.source_dataset_uuid", "readme.derived_from.uuid" ], "accessed_on" : ISODate("2020-10-07T12:24:32.724Z") }
{ "_id" : ObjectId("5f7ce55caea9fcf239f68dbb"), "name" : "dep:2020-10-06T21:45:00.525410", "keys" : [ "readme.derived_from.uuid" ], "accessed_on" : ISODate("2020-10-06T21:45:00.538Z") }
{ "_id" : ObjectId("5f7ce578aea9fcf239f68dbc"), "name" : "dep:2020-10-06T21:45:28.495903", "keys" : [ "bla" ], "accessed_on" : ISODate("2020-10-06T21:45:28.498Z") }

and querying with a specific set of keys for the first time

$ curl -H "$HEADER" -H "Content-Type: application/json"  \
    -X POST -d  \
    '["another.possibly_nested.dependency_key"]'  \
    http://localhost:5000/graph/lookup/$UUID

will result in an additional view named uniquely by the current UTC time:

> show collections
datasets
dep:2020-10-05T01:22:39.581592
dep:2020-10-06T21:45:00.525410
dep:2020-10-06T21:45:28.495903
dep:2020-10-07T17:03:58.831223
dep_views
dependencies
system.views

and an according entry within dep_views

> db.dep_views.find()
{ "_id" : ObjectId("5f7a755faea9fcf239f68dba"), "name" : "dep:2020-10-05T01:22:39.581592", "keys" : [ "annotations.source_dataset_uuid", "readme.derived_from.uuid" ], "accessed_on" : ISODate("2020-10-07T16:59:12.467Z") }
{ "_id" : ObjectId("5f7ce55caea9fcf239f68dbb"), "name" : "dep:2020-10-06T21:45:00.525410", "keys" : [ "readme.derived_from.uuid" ], "accessed_on" : ISODate("2020-10-06T21:45:00.538Z") }
{ "_id" : ObjectId("5f7ce578aea9fcf239f68dbc"), "name" : "dep:2020-10-06T21:45:28.495903", "keys" : [ "bla" ], "accessed_on" : ISODate("2020-10-06T21:45:28.498Z") }
{ "_id" : ObjectId("5f7df4feaea9fcf239f68dbd"), "name" : "dep:2020-10-07T17:03:58.831223", "keys" : [ "another.possibly_nested.dependency_key" ], "accessed_on" : ISODate("2020-10-07T17:03:58.833Z") }

If the total number of such cached views exceeds the allowed maximum value configured in DTOOL_LOOKUP_SERVER_MONGO_DEPENDENCY_VIEW_CACHE_SIZE, then the view accessed least recently is dropped.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file dtool-lookup-server-dependency-graph-plugin-0.2.0.tar.gz.

File metadata

File hashes

Hashes for dtool-lookup-server-dependency-graph-plugin-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4261e2708e44d7e224c53a8afa545ca6ecb28d9ba25019f3a8631a3897a90517
MD5 e60bb28950a42a7f39e7c1be88360f85
BLAKE2b-256 aa574631ee39b7ab9e4f5881d3f822912d8369e28f842ea4c92c7258d53ee5e1

See more details on using hashes here.

File details

Details for the file dtool_lookup_server_dependency_graph_plugin-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dtool_lookup_server_dependency_graph_plugin-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 18d942e200efd0a758eb5aa9fcb595f7ed9ad7f62cb4d0dcbde09c5e145f05aa
MD5 a728ef734d6a4f6b9d63e4627ea6b501
BLAKE2b-256 f512bbc51799651e4bcc841f5c283230e373ec587d0a1ffdeb450306c8c07be2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page