Skip to main content

Data packaging tools for the HEAL data ecosystem

Project description

--8<-- [start:intro]

HEAL Data Utilities

The HEAL data utilities python package provides data packaging tools for the HEAL data ecosystem to facilitate data discovery,sharing, and harmonization with a focus on the HEAL platform data consultancy (DSC).

Currently, the focus of the repo is on generating data-dictionaries (see Variable level metadata section below). However, in the future, this will be expanded for all heal specific data packaging functions (e.g., study and file level metadata and data).

Installation

To install the latest official release of healdata-utils, from your computer's command prompt, run:

pip install healdata-utils --pre (NOTE: currently in pre-lease)

pip install git+https://github.com/norc-heal/healdata-utils.git

Variable level metadata (data dictionaries)

Binder

The healdata-utils variable level metadata (vlmd) tool inputs a variety of different input file types and exports HEAL-formatted data dictionaries (JSON and CSV formats). Additionally, exported validation (ie "error") reports provide the user information as to a. if the exported data dictionary is valid according to HEAL specifications (see the schema repository here).

For support formats and more detailed software specific instructions and recommendations, see here

Basic usage

The vlmd tool can be used via python or the command line.

Using from python

From your current working directory in python, run:

from healdata_utils.cli import convert_to_vlmd

# description and title are optional. If submitting through platform, can fill these out there.
description = "This is a proof of concept to demonstrate the healdata-utils functionality"
title = "Healdata-utils Demonstration Data Dictionary"
healdir = "output" # can also specify a file name if desired (eg output/thisismynewdd.csv)
inputpath = "input/my-redcap-data-dictionary-export.csv"

data_dictionaries = convert_to_vlmd(
    filepath=inputpath,
    outputdir=healdir, 
    inputtype=input_type, #if not specified, looks for suffix
    data_dictionary_props={"title":title,"description":description} #data_dictionary_props is optional
)

This will output the data dictionaries to the specified output directory (see ooutput section below) and also save the json/csv versions in the data_dictionaries object.

For the available input file formats (ie the available choices for the inputtype parameter), one can run (from python):

from healdata_utils.cli import input_descriptions

input_descriptions

The input_descriptions object contains the choice for inputtype as the key and the description as the value.

Using from the command line

From your current working directory run: (note the \ at the end of each line signals a line continuation for ease in understanding the long one line command.) Again the --title and --description options are optional. For descriptions on the different flags/options, run vlmd --help

vlmd --filepath "data/example_pyreadstat_output.sav" \
--outputdir "output-cli" \
--title "Healdata-utils Demonstration Data Dictionary" \
--description "This is a proof of concept to demonstrate the healdata-utils functionality" 

Output

Both the python and command line routes will result in a JSON and CSV version of the HEAL data dictionary in the output folder along with the validation reports in the errors folder. See below:

  • input/input/my-redcap-data-dictionary-export.csv : your input file

  • output/errors/heal-csv-errors.json: outputted validation report for table in csv file against frictionless schema

  • output/errors/heal-json-errors.json: outputted jsonschema validation report.

!!! important The main difference* between the CSV and JSON data dictionary validation lies in the way the data dictionaries are structured and the additional metadata included in the JSON data dictionary.

The CSV data dictionary is a plain tabular representation with no additional metadata, while the JSON dataset includes fields along with additional metadata in the form of a root description and title.

* for field-specific differences, see the schemas in the documentation. 
  • output/heal-csvtemplate-data-dictionary.csv: This is the CSV data dictionary
  • output/heal-jsontemplate-data-dictionary.json: This is the JSON version of the data dictionary

Note, only the JSON version will have the user-specified title and description

Interactive notebooks

See the below notebooks demonstrating use and workflows using the convert_to_vlmd in python and vlmd in the command line.

Clicking on the "binder badges" will bring you to an interactive notebook page where you can test out the notebooks. Here, healdata-utils comes pre-installed.

  1. Generating a heal data dictionary from a variety of input files
  1. [in development] Creating and iterating over a csv data dictionary to create a valid data dictionary file click here

--8<-- [end:intro]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

healdata_utils-0.0.8a0.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

healdata_utils-0.0.8a0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file healdata_utils-0.0.8a0.tar.gz.

File metadata

  • Download URL: healdata_utils-0.0.8a0.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for healdata_utils-0.0.8a0.tar.gz
Algorithm Hash digest
SHA256 799f821b29d9fe018a35be2d1b07f375e9363908ca895423f1597a2039e13749
MD5 b483e8444d3c62958fb5f0db09919a04
BLAKE2b-256 bc62b8a23e295b4158f15d40a00fb4ace3d0ec97a57dc4b4fb832bd4a34acf72

See more details on using hashes here.

File details

Details for the file healdata_utils-0.0.8a0-py3-none-any.whl.

File metadata

File hashes

Hashes for healdata_utils-0.0.8a0-py3-none-any.whl
Algorithm Hash digest
SHA256 3186e83350b4bf2c33258e5bc77f0d82e623d3fb3eab5701977fb47265f8e8f1
MD5 8b9f90a2db6b5fdf01b3cdf2ec8c2b9d
BLAKE2b-256 e545d69f4985028fd7ba31c4d9d87cbcc32e55ed3157d13338959862b87f289c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page