Skip to main content

Helper utilities for SDMC ad-hoc data processing requests.

Project description

sdmc tools

This package contains a collection of functions designed for the standard cleaning and processing of assay data by SDMC before the data is shared with stats.

Installation

The package is hosted on Pypi and can be installed using pip: pip install sdmc-tools.

  • python >= 3.8 is required. these functions might break with earlier python versions.
  • The following packages are depenencies:
    • pandas
    • numpy
    • datetime
    • math
    • typing
    • markdown
    • sys
    • time

Usage

Data processing


Python functions and constants for data processing / prep.

The primary function is standard_processing:

import sdmc_tools.process as sdmc

outputs = sdmc.standard_processing(
    input_data = input_data,
    input_data_path="/path/to/input_data.xlsx", 
    guspec_col='guspec', 
    network='hvtn', 
    metadata_dict=hand_appended_metadata, 
    ldms=ldms 
)

To see the function signature and documentation, you can run ? sdmc.standard_processing in a Python interpreter. Given input_data, the function does the following:

  • merges on ldms, renames columns with standard labels
  • adds a spectype column
  • adds a drawdt column, drops drawdm, drawdd, drawdy
  • for each (key,value) in the metadata dict creates a column of the name 'key' with values 'value'
  • standardizes the 'ptid' and 'protocol' columns to be int-formatted strings
  • merges on columns pertaining to sdmc processing
  • rearranges columns into a standardized order
  • converts column names "From This" -> "to_this" format

See https://github.com/beatrixh/sdmc-tools/blob/main/src/sdmc_tools/constants.py for the list of constants accessible.

A usage example is included below.

import pandas as pd
import sdmc_tools.process as sdmc # this contains the main data processing utilities
import sdmc_tools.constants as constants # this contains useful constants.

ldms = pd.read_csv(constants.LDMS_PATH_HVTN, usecols=constants.STANDARD_COLS) #read in ldms
ldms = ldms.loc[ldms.lstudy==302.] #subset ldms to the protocol of interest

ldms

image

input_data

image

hand_appended_metadata = {
    'network': 'HVTN',
    'upload_lab_id': 'N4',
    'assay_lab_name': 'Name of Lab Here',
    'instrument': 'SpectraMax',
    'assay_type': 'Neutralizing Antibody (NAb)',
    'specrole': 'Sample',
}

outputs = sdmc.standard_processing(
    input_data = input_data, #a pandas dataframe containing input data
    input_data_path="/path/to/input_data.xlsx", #the path to the original input data
    guspec_col='guspec', #the name of the column containing guspecs within the input data
    network='hvtn', #the relevant network ('hvtn' or 'covpn')
    metadata_dict=hand_appended_metadata, #a dictionary of additional data to append as columns
    ldms=ldms #a pandas dataframe containing the ldms columns we want to merge from
)

outputs

image image

Data dictionary creation


This is a command line tool; it creates a data dictionary for a set of processed outputs.

gen-data-dict takes two positional arguments:

  • the filepath where the outputs are stored,
  • and the desired name of the resulting data dict.
gen-data-dict /path/to/outputs.txt name_of_dictionary.xlsx

If the dictionary does not already exist in the directory where the outputs live, it will then create

  • an xlsx sheet in the same directory as the outputs, with a row for each variable in the outputs, and corresponding definitions for the standard vars. The variables unique to the specific outputs will need to be hand-edited.
  • a .txt log in the same directory with notes about any non-standard variables that have been included, or any standard variables that have been omitted.

If a dictionary of the given name already exists, it will be updated to reflect the variables in the output sheet.

README creation


This is a command line tool; given a set of processed outputs, it creates a .md file with documentation for how the outputs were created, and a correspdonding .html of the compiled .md.

gen-readme takes three positional arguments:

  • the filepath to the directory where the outputs are stored
  • the filepath to the raw input data from which the processed outputs were generated
  • the author to attribute the readme to
gen-data-dict /path/to/outputs_dir/ /path/to/inputs_from_lab.xlsx "Beatrix Haddock"

It will then create

  • a markdown file describing how the outputs were created, including notes of where the inputs are saved. Note that it will assume the processing was standard, so this will need to be corrected for any nonstandard processing. It will search the output directory for the processed data outputs, a pivot summary of the samples, and the processing code. If it doesn't find these there, it will not include notes on these in the markdown.
  • an html file created via compiling the above markdown

regen-readme takes one positional argument:

  • a filepath to the markdown to compile
regen-readme /path/to/my_markdown.md

It will then compile into an html file in the same directory and of the same name. If such an html file already exists, it will be overwritten.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmc_tools-0.0.2.tar.gz (154.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdmc_tools-0.0.2-py3-none-any.whl (159.4 kB view details)

Uploaded Python 3

File details

Details for the file sdmc_tools-0.0.2.tar.gz.

File metadata

  • Download URL: sdmc_tools-0.0.2.tar.gz
  • Upload date:
  • Size: 154.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for sdmc_tools-0.0.2.tar.gz
Algorithm Hash digest
SHA256 4a6b5b03c09908ea539858d84be11e7dd0855f8d3c3fa10fa4b07e2c15b21350
MD5 d42faf4576aaf752db5b97570ea2c243
BLAKE2b-256 e5733a755367ee073074e4dda7032378e0ce030addf9afeafea5bb350eb1208c

See more details on using hashes here.

File details

Details for the file sdmc_tools-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: sdmc_tools-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 159.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for sdmc_tools-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 38e76830c0d6f156dea7762839726e025403e6588821188f68f050e505592d3d
MD5 542a2614ffd665b1e93ecafe66fc1296
BLAKE2b-256 a96ea983622a9cda14167e0b2a00101c312dbc7e73dfb952a88da75140b52444

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page