Skip to main content

dtool CLI utilities for working with per item metadata

Project description

PyPi package Travis CI build status (Linux) Code Coverage

Installation

pip install dtool-overlay

Example usage

Get a dataset to play with:

LOCAL_DS_URI=$(dtool cp -q http://bit.ly/Ecoli-ref-genome .)

Show the existing overlays:

$ dtool overlays show $LOCAL_DS_URI
identifiers,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,reference.2.bt2

The output above show that there are no overlays on this dataset. (The “identifiers” and “relpaths” columns are there for bookkeeping).

Create a “is_fasta” boolean overlay template by using a glob pattern:

$ dtool overlays template glob $LOCAL_DS_URI is_fasta '*.fasta' > is_fasta.csv
$ cat is_fasta.csv
identifiers,is_fasta,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,False,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,False,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,False,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,False,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,True,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,False,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,False,reference.2.bt2

Write the overlay template to the dataset:

$ dtool overlays write $LOCAL_DS_URI is_fasta.csv

Show the newly created overlay:

$ dtool overlays show $LOCAL_DS_URI
identifiers,is_fasta,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,False,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,False,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,False,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,False,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,True,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,False,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,False,reference.2.bt2

To extract multiple pieces of metadata from the items’ relpath one can use the dtool overlays template parse command. This takes as input a dataset URI, a parse rule (see https://pypi.org/project/parse/ for more details) and a glob rule. The latter decides which relpaths to apply the parsing to.

Consider for example the dataset below:

$ dtool ls http://bit.ly/Ecoli-reads-minified
8bda245a8cd526673aab775f90206c8b67d196af  ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc  ERR022075_1.fastq.gz

The command below could be used to generate a template for the overlays “useful_name” and “read”:

$ dtool overlays template parse  \
    http://bit.ly/Ecoli-reads-minified  \
    '{useful_name}_{read:d}.fastq.gz'

Results in the CSV output below:

identifiers,read,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,2,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,1,ERR022075,ERR022075_1.fastq.gz

To ignore a variable element when parsing one can use unnamed curly braces. The command below for example only generates the overlay “useful_name”:

$ dtool overlays template parse  \
    http://bit.ly/Ecoli-reads-minified  \
    '{useful_name}_{:d}.fastq.gz'
identifiers,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075,ERR022075_1.fastq.gz

Sometimes it is useful to be able to find pairs of items. For example when dealing with genomic sequencing data that has forward and reverse reads.

One can create a “pair_id” overlay CSV template for this dataset using the command below:

$  dtool overlays template pairs http://bit.ly/Ecoli-reads-minified .fastq.gz
identifiers,pair_id,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,8bda245a8cd526673aab775f90206c8b67d196af,ERR022075_1.fastq.gz

In the above the suffix “.fastq.gz” is used to extract the prefix ERR022075_ that is used to find matching pairs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dtool-overlay-0.3.0.tar.gz (6.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page