Skip to main content

Tools to organize, document, and validate the variables of interest in scientific studies

Project description

masterfile

DOI

Tools to organize, document, and validate the variables of interest in scientific studies

Command line usage

masterfile --help will list all the subcommands.

Create

masterfile create masterfile_path out_file

Join

masterfile join masterfile_path out_file

Extract

masterfile extract [-s|--skip ROWS] [--index_column COL]
                      masterfile_path csv_file out_file

Validate

asterfile validate masterfile_path [file [file ...]]

Draft API usage example

import masterfile
# Load all of the .csv files from /path, and the dictionary files in
# /path/dictionaries. Takes settings info from a 'settings.json' file in
# /path.
# joins the .csv files on 'participant_id', which will be used as the index
# There will be warnings if the data look bad in some way
mf = masterfile.load('/path')
# Get the pandas dataframe associated
df = mf.dataframe  # aliased as mf.df

# All the variable stuff is less important, people can go look in data dicts
# So we'll write that stuff later.
v = mf.lookup('sr_t1_panas_pa')
v.contacts # list_of_names
v.measure.contact  # Someone
v.modality # Component("self-report")

CSV file format

CSV files should be comma-separated (no surprise there) and have DOS line endings (CRLF). They should not have the stupid UTF-8 signature at the start. UTF-8 characters are fine. Missing data is indicated by an empty cell. Quoting should be like Excel does.

Basically, you want Excel-for-Windows-style CSV files with no UTF-8 signature.

Dictionaries

  • CSV format
  • Has AT LEAST two columns: component, short_name
  • Those are the indexes
  • There shouldn't be any repeats in the index
  • The settings.json file should contain a "components" thing that says what should exist in the component column
  • Things with blank component are ignored (TODO: Maybe?)

Exclusion files

  • CSV format
  • Live in exclusions/
  • One row per ppt, one column per value
  • Has index column, same as data file
  • Blanks mean "Use this value," nonblanks mean "exclude this value"
  • Things in the cells may be codes; these codes may be defined in settings.json
  • If data is excluded for more than one reason, separate codes with ","
  • Not all rows / columns in masterfiles need to be included in exclusion files. Missing rows / columns are treated like blank values.

Data checks

Here are some (all?) of the things to do to verify you have semantically reasonable data:

  • Variable parts not in dictionaries
  • Missing participant_id column
  • Repeated paticipant_id column
  • Blanks in participant_id column
  • Duplicate columns
  • Column names not matching format

Getting started for development

Create a virtualenv:

virtualenv ~/env/masterfile
source ~/env/masterfile/bin/activate

Install the requirements and this module for development:

pip install -r requirements_dev.txt
pip install -e .

Run tests:

pytest

Run tests across all supported Python versions:

tox

To run in a specific python version:

tox -e py37

Credits

Written by Nate Vack njvack@wisc.edu with help from Dan Fitch dfitch@wisc.edu

masterfile packages some wonderful tools: schema and attrs.

schema is copyright (c) 2012 Vladimir Keleshev, vladimir@keleshev.com

attrs is copyright (c) 2015 Hynek Schlawack

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

masterfile-0.5.0.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

masterfile-0.5.0-py2.py3-none-any.whl (47.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file masterfile-0.5.0.tar.gz.

File metadata

  • Download URL: masterfile-0.5.0.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for masterfile-0.5.0.tar.gz
Algorithm Hash digest
SHA256 c6ef7e4f8bef5cae49c8bbcf55c2b3867055c1930b36d7429fa1651057919ab0
MD5 c8e3da00ff018161e0948857bb7c1707
BLAKE2b-256 6078fbb5ba439a4a411cf63c34455e2cf3465e037ba8bf29d8eaf25e73a30460

See more details on using hashes here.

File details

Details for the file masterfile-0.5.0-py2.py3-none-any.whl.

File metadata

  • Download URL: masterfile-0.5.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 47.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for masterfile-0.5.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 275042f7d934e076400ef289d6b7b71b94afcd364d4f2afe674c505feba76805
MD5 dd5eb71e86bda93c222c4d3cb210b753
BLAKE2b-256 71613b021b1cf3e1ec42fd5557e63d021efde6202dbb23057da6dd2d385df9c2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page