User friendly tools for downloading and manipulating gen3 metadata

These details have not been verified by PyPI

Project description

gen3-metadata

User friendly tools for downloading and manipulating gen3 metadata.

Python

Installation

git clone https://github.com/AustralianBioCommons/gen3-metadata.git
bash build.sh

A full usage notebook is available in example_notebook.ipynb. Make sure to select .venv as the kernel.

Fetch all metadata

fetch_all_metadata is the primary entry point. It walks the data dictionary in dependency order and fetches data for every node, returning a dot-accessible object of JSON dicts. Call .to_df() on the result to get pandas DataFrames instead.

from gen3_metadata.gen3_metadata_parser import fetch_all_metadata

key_file = "path/to/credentials.json"
result = fetch_all_metadata(key_file, "program1", "project1")

# Access each node as raw JSON
result.subject          # dict
result.demographic      # dict

# Or get DataFrames
dfs = result.to_df()
dfs.subject             # pandas DataFrame
dfs.demographic         # pandas DataFrame

Filtering by data release

fetch_all_metadata accepts a data_release argument that filters each node's records by release. The default is "latest" — each node is inspected for a data_release_date field and only records matching the max ISO date are returned. The selected version and date are logged per node.

# Default: per node, keep records with the max data_release_date
result = fetch_all_metadata(key_file, "program1", "project1")
# node 'subject': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# node 'demographic': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# ...

# Pin to a specific release (exact, case-sensitive match on data_release field)
result = fetch_all_metadata(key_file, "program1", "project1", data_release="v2.3")

# Disable filtering — return every record, no filter logs
result = fetch_all_metadata(key_file, "program1", "project1", data_release=None)

Behavior per node:

`data_release` value	Behavior
`"latest"` (default)	Keep records with the max `data_release_date` (ISO 8601). Log the selected date and version.
any other string	Keep records where `data_release` equals that string exactly. Log the selected version and date.
`None`	No filtering. No filter log lines emitted.

Nodes that have neither a data_release nor a data_release_date field (for example lookup/link nodes like program or project) are passed through unchanged, with an info log noting they were not filtered. Unparseable ISO dates in "latest" mode are skipped with a warning.

The same data_release argument is also available on Gen3MetadataParser.fetch_data and fetch_data_json for single-node fetches.

Logging and error handling

fetch_all_metadata prints per-node progress to stdout by default (verbose=True). If you want richer diagnostic output — filter decisions, authentication steps, debug messages — call configure_logging() once in your REPL or notebook to attach a stderr handler to the gen3_metadata logger:

import logging
import gen3_metadata

gen3_metadata.configure_logging()              # INFO by default
# gen3_metadata.configure_logging(logging.DEBUG)  # or more verbose

Each module logs under its own gen3_metadata.<module> name, so you can filter with standard logging machinery.

Network calls inside fetch_all_metadata are timeout-guarded (30s per request) and wrapped with friendly error messages. A VPN/connectivity failure produces a clean one-liner instead of a urllib3 traceback:

fetch_all_metadata: starting for program1/CDAH
fetch_all_metadata: fetching data dictionary from cad.staging.biocommons.org.au...
RuntimeError: Could not reach cad.staging.biocommons.org.au to fetch the Gen3
data dictionary. Check VPN / network connectivity. Underlying error: ...

Individual nodes that time out or return an HTTP error during the fetch loop are logged and skipped; the overall call completes with a final summary of which nodes succeeded and which failed.

List nodes

get_node_order returns a topologically sorted list of node names from the data dictionary (parents before children).

from gen3_metadata.gen3_metadata_parser import get_node_order

nodes = get_node_order("path/to/credentials.json")
# ['program', 'project', 'subject', 'sample', 'demographic', ...]

Fetch a single node

If you only need one node, use Gen3MetadataParser + fetch_data_json. It returns the raw JSON response as a dict. Convert to a DataFrame yourself if you need one.

import pandas as pd
from gen3_metadata.gen3_metadata_parser import Gen3MetadataParser

key_file = "path/to/credentials.json"
parser = Gen3MetadataParser(key_file)
parser.authenticate()

# Default: filters to latest data_release_date
json_data = parser.fetch_data_json("program1", "project1", node_label="medical_history")
json_data  # {'data': [...]}

# Pin to a specific release, or pass data_release=None to disable filtering
json_data = parser.fetch_data_json(
    "program1", "project1", node_label="medical_history", data_release="v2.3"
)

# Convert to DataFrame if desired:
df = pd.json_normalize(json_data["data"])

Running Tests

pytest -vv tests/

R

As of v1.3.0 the R package is fully standalone — no Python interpreter, no reticulate, and no Python gen3_metadata package required. A single devtools::install_github(...) is all you need, which makes containerized RStudio deployments significantly simpler.

Installation

Always-latest from main:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("AustralianBioCommons/gen3-metadata", subdir = "gen3metadata-R")

Pinned to a specific release (recommended for reproducible environments such as Docker images):

if (!require("devtools")) install.packages("devtools")
devtools::install_github(
    "AustralianBioCommons/gen3-metadata",
    subdir = "gen3metadata-R",
    ref    = "v1.4.0"
)

CRAN dependencies are installed automatically with the package. If your environment requires installing them manually:

install.packages(c("httr", "jsonlite", "jose", "glue"))

In a Dockerfile (RStudio container)

Add a single layer to your image:

RUN R -e 'if (!require("devtools")) install.packages("devtools", repos="https://cloud.r-project.org"); \
          devtools::install_github("AustralianBioCommons/gen3-metadata", subdir = "gen3metadata-R", ref = "v1.4.0")'

No Python or pip step is required.

Loading

library("gen3metadata")

Fetch all metadata

fetch_all_metadata is the primary entry point. It walks the data dictionary in dependency order and fetches data for every node, returning a metadata_collection object where nodes are accessible via $. Call to_df() on it to get data.frames instead.

result <- fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")

# Access each node as raw JSON (nested list)
result$subject
result$demographic

# Or get data.frames
dfs <- to_df(result)
dfs$subject         # data.frame
dfs$demographic     # data.frame

Filtering by data release

fetch_all_metadata accepts a data_release argument that mirrors the Python API. The default is "latest" — each node is inspected for a data_release_date field and only records matching the max ISO date are kept. The selected version and date are emitted per node via message().

# Default: per node, keep records with the max data_release_date
result <- fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")
# node 'subject': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# node 'demographic': selected data_release_date=2024-06-01 data_release='v2.3' (123/22494 records)
# ...

# Pin to a specific release (exact, case-sensitive match on data_release field)
result <- fetch_all_metadata(
    "path/to/credentials.json", "program1", "AusDiab",
    data_release = "v2.3"
)

# Disable filtering — return every record, no filter messages
result <- fetch_all_metadata(
    "path/to/credentials.json", "program1", "AusDiab",
    data_release = NULL
)

# Suppress the per-node message output while keeping the filter active
result <- suppressMessages(
    fetch_all_metadata("path/to/credentials.json", "program1", "AusDiab")
)

Behavior per node:

`data_release` value	Behavior
`"latest"` (default)	Keep records with the max `data_release_date` (ISO 8601). Emit selected date and version.
any other string	Keep records where `data_release` equals that string exactly. Emit selected version and date.
`NULL`	No filtering. No filter messages emitted.

Nodes that have neither a data_release nor a data_release_date field are passed through unchanged with a message. Unparseable ISO dates in "latest" mode are skipped with a warning.

The same data_release argument is also available on fetch_data() for single-node fetches.

Progress messages and error handling

fetch_all_metadata emits per-node progress via message() (visible in interactive sessions and captured by knitr/RMarkdown by default). A DNS/VPN failure produces a clean stop() error rather than an httr traceback, e.g.:

fetch_all_metadata: starting for program1/AusDiab
fetch_all_metadata: fetching data dictionary...
Error: Could not fetch Gen3 data dictionary. Check VPN / network connectivity. Underlying error: ...

All network calls (authentication, dictionary fetch, per-node fetch) have a 30-second timeout, so a flaky network surfaces quickly instead of hanging indefinitely. To silence the progress output:

result <- suppressMessages(fetch_all_metadata(...))

List nodes

get_node_order returns a topologically sorted character vector of node names from the data dictionary.

nodes <- get_node_order("path/to/credentials.json")
# [1] "program" "project" "subject" "sample" "demographic" ...

Fetch a single node

If you only need one node, use Gen3MetadataParser + fetch_data. It returns the raw JSON data as a nested list. Convert to a data.frame yourself if you need one.

key_file_path <- "path/to/credentials.json"

gen3 <- Gen3MetadataParser(key_file_path)
gen3 <- authenticate(gen3)

# Default: filters to latest data_release_date
data <- fetch_data(gen3,
                   program_name = "program1",
                   project_code = "AusDiab",
                   node_label = "subject")

# Pin to a specific release, or pass data_release = NULL to disable filtering
data <- fetch_data(gen3,
                   program_name = "program1",
                   project_code = "AusDiab",
                   node_label = "subject",
                   data_release = "v2.3")

# data is a list of records

# Convert to a data.frame if desired:
df <- do.call(rbind, lapply(data, as.data.frame))

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.4.0

Apr 15, 2026

1.3.0

Apr 13, 2026

1.2.0

Apr 13, 2026

1.1.0

Jul 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gen3_metadata-1.4.0.tar.gz (17.5 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gen3_metadata-1.4.0-py3-none-any.whl (16.9 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file gen3_metadata-1.4.0.tar.gz.

File metadata

Download URL: gen3_metadata-1.4.0.tar.gz
Upload date: Apr 15, 2026
Size: 17.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.12 Darwin/25.4.0

File hashes

Hashes for gen3_metadata-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`3f84c7c10f6bd1b79ff38d30c8966fd888d3058aa46641b0841805deb1ed56e8`
MD5	`5480cce88b34e88ef37ce5ccbaf844db`
BLAKE2b-256	`fadb0135832ea6fa0b018a85888b00a25ef2a1951198db005f538761f4da3628`

See more details on using hashes here.

File details

Details for the file gen3_metadata-1.4.0-py3-none-any.whl.

File metadata

Download URL: gen3_metadata-1.4.0-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.12 Darwin/25.4.0

File hashes

Hashes for gen3_metadata-1.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d6966b92406cd647d1e1f2e43bc12199ccda3de90901aebefb05bf59fcc77c9`
MD5	`c870b3b27a666c75ee1c58bb37124aec`
BLAKE2b-256	`615c27ab40e560acfaa3dc3c86f933be49f1bd5b2407d337d10aa0884a33e06e`

See more details on using hashes here.

gen3-metadata 1.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

gen3-metadata

Python

Installation

Fetch all metadata

Filtering by data release

Logging and error handling

List nodes

Fetch a single node

Running Tests

R

Installation

In a Dockerfile (RStudio container)

Loading

Fetch all metadata

Filtering by data release

Progress messages and error handling

List nodes

Fetch a single node

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes