Skip to main content

User friendly tools for downloading and manipulating gen3 metadata

Project description

gen3-metadata

User friendly tools for downloading and manipulating gen3 metadata.

Python

Installation

git clone https://github.com/AustralianBioCommons/gen3-metadata.git
bash build.sh

A full usage notebook is available in example_notebook.ipynb. Make sure to select .venv as the kernel.

Fetch all metadata

fetch_all_metadata is the primary entry point. It walks the data dictionary in dependency order and fetches data for every node, returning a dot-accessible object of JSON dicts. Call .to_df() on the result to get pandas DataFrames instead.

from gen3_metadata.gen3_metadata_parser import fetch_all_metadata

key_file = "path/to/credentials.json"
result = fetch_all_metadata(key_file, "program1", "project1")

# Access each node as raw JSON
result.subject          # dict
result.demographic      # dict

# Or get DataFrames
dfs = result.to_df()
dfs.subject             # pandas DataFrame
dfs.demographic         # pandas DataFrame

Filtering by data release

fetch_all_metadata accepts a data_release argument that filters each node's records by release. The default is "latest" — each node is inspected for a data_release_date field and only records matching the max ISO date are returned. The selected version and date are logged per node.

# Default: per node, keep records with the max data_release_date
result = fetch_all_metadata(key_file, "program1", "project1")
# node 'subject': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# node 'demographic': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# ...

# Pin to a specific release (exact, case-sensitive match on data_release field)
result = fetch_all_metadata(key_file, "program1", "project1", data_release="v2.3")

# Disable filtering — return every record, no filter logs
result = fetch_all_metadata(key_file, "program1", "project1", data_release=None)

Behavior per node:

data_release value Behavior
"latest" (default) Keep records with the max data_release_date (ISO 8601). Log the selected date and version.
any other string Keep records where data_release equals that string exactly. Log the selected version and date.
None No filtering. No filter log lines emitted.

Nodes that have neither a data_release nor a data_release_date field (for example lookup/link nodes like program or project) are passed through unchanged, with an info log noting they were not filtered. Unparseable ISO dates in "latest" mode are skipped with a warning.

The same data_release argument is also available on Gen3MetadataParser.fetch_data and fetch_data_json for single-node fetches.

List nodes

get_node_order returns a topologically sorted list of node names from the data dictionary (parents before children).

from gen3_metadata.gen3_metadata_parser import get_node_order

nodes = get_node_order("path/to/credentials.json")
# ['program', 'project', 'subject', 'sample', 'demographic', ...]

Fetch a single node

If you only need one node, use Gen3MetadataParser + fetch_data_json. It returns the raw JSON response as a dict. Convert to a DataFrame yourself if you need one.

import pandas as pd
from gen3_metadata.gen3_metadata_parser import Gen3MetadataParser

key_file = "path/to/credentials.json"
parser = Gen3MetadataParser(key_file)
parser.authenticate()

# Default: filters to latest data_release_date
json_data = parser.fetch_data_json("program1", "project1", node_label="medical_history")
json_data  # {'data': [...]}

# Pin to a specific release, or pass data_release=None to disable filtering
json_data = parser.fetch_data_json(
    "program1", "project1", node_label="medical_history", data_release="v2.3"
)

# Convert to DataFrame if desired:
df = pd.json_normalize(json_data["data"])

Running Tests

pytest -vv tests/

R

Installation

if (!require("devtools")) install.packages("devtools")
devtools::install_github("AustralianBioCommons/gen3-metadata", subdir = "gen3metadata-R")

The package depends on several other packages, which should be installed automatically. If not:

install.packages(c("httr", "jsonlite", "jose", "glue", "reticulate"))

The get_node_order and fetch_all_metadata functions require the Python gen3_metadata package to be installed (they use reticulate to call Python under the hood).

library("gen3metadata")

Fetch all metadata

fetch_all_metadata is the primary entry point. It walks the data dictionary in dependency order and fetches data for every node, returning a metadata_collection object where nodes are accessible via $. Call to_df() on it to get data.frames instead.

result <- fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")

# Access each node as raw JSON (nested list)
result$subject
result$demographic

# Or get data.frames
dfs <- to_df(result)
dfs$subject         # data.frame
dfs$demographic     # data.frame

Filtering by data release

fetch_all_metadata accepts a data_release argument that mirrors the Python API. The default is "latest" — each node is inspected for a data_release_date field and only records matching the max ISO date are kept. The selected version and date are emitted per node via message().

# Default: per node, keep records with the max data_release_date
result <- fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")
# node 'subject': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# node 'demographic': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# ...

# Pin to a specific release (exact, case-sensitive match on data_release field)
result <- fetch_all_metadata(
    "path/to/credentials.json", "program1", "AusDiab",
    data_release = "v2.3"
)

# Disable filtering — return every record, no filter messages
result <- fetch_all_metadata(
    "path/to/credentials.json", "program1", "AusDiab",
    data_release = NULL
)

# Suppress the per-node message output while keeping the filter active
result <- suppressMessages(
    fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")
)

Behavior per node:

data_release value Behavior
"latest" (default) Keep records with the max data_release_date (ISO 8601). Emit selected date and version.
any other string Keep records where data_release equals that string exactly. Emit selected version and date.
NULL No filtering. No filter messages emitted.

Nodes that have neither a data_release nor a data_release_date field are passed through unchanged with a message. Unparseable ISO dates in "latest" mode are skipped with a warning.

The same data_release argument is also available on fetch_data() for single-node fetches.

List nodes

get_node_order returns a topologically sorted character vector of node names from the data dictionary.

nodes <- get_node_order("path/to/credentials.json")
# [1] "program" "project" "subject" "sample" "demographic" ...

Fetch a single node

If you only need one node, use Gen3MetadataParser + fetch_data. It returns the raw JSON data as a nested list. Convert to a data.frame yourself if you need one.

key_file_path <- "path/to/credentials.json"

gen3 <- Gen3MetadataParser(key_file_path)
gen3 <- authenticate(gen3)

# Default: filters to latest data_release_date
data <- fetch_data(gen3,
                   program_name = "program1",
                   project_code = "AusDiab",
                   node_label = "subject")

# Pin to a specific release, or pass data_release = NULL to disable filtering
data <- fetch_data(gen3,
                   program_name = "program1",
                   project_code = "AusDiab",
                   node_label = "subject",
                   data_release = "v2.3")

# data is a list of records

# Convert to a data.frame if desired:
df <- do.call(rbind, lapply(data, as.data.frame))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gen3_metadata-1.2.0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gen3_metadata-1.2.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file gen3_metadata-1.2.0.tar.gz.

File metadata

  • Download URL: gen3_metadata-1.2.0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.12 Darwin/25.4.0

File hashes

Hashes for gen3_metadata-1.2.0.tar.gz
Algorithm Hash digest
SHA256 fe3e57286ac5a28703e2a88ae449afdac243d18862fa2b8e60bdfeeef3eb0408
MD5 bd09e148b2e9c913fd07ad265d618fe3
BLAKE2b-256 9c1ae9003a250af8bd70cf1125d384fc0313214080ce10906fcf7daba4944591

See more details on using hashes here.

File details

Details for the file gen3_metadata-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: gen3_metadata-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.12 Darwin/25.4.0

File hashes

Hashes for gen3_metadata-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 edbe0a7d75097ebd7ac03a3b792bcfa899366c8d84c0bf3c553a8406bbb9f6e4
MD5 88dac3a4f6650ad612e231be32e44ea2
BLAKE2b-256 17ed50425a72cf3d7ae4d75e57ccd823a4214d5d2b51d56de8def1465d171fb2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page