Skip to main content

The FMRIB UKBiobank Normalisation, Parsing And Cleaning Kit

Project description

https://img.shields.io/pypi/v/fmrib-unpack.svg https://anaconda.org/conda-forge/fmrib-unpack/badges/version.svg https://zenodo.org/badge/DOI/10.5281/zenodo.1997626.svg https://git.fmrib.ox.ac.uk/fsl/funpack/badges/master/coverage.svg

FUNPACK is a Python library for pre-processing of UK BioBank data.

FUNPACK is developed at the Wellcome Centre for Integrative Neuroimaging (WIN@FMRIB), University of Oxford. FUNPACK is in no way endorsed, sanctioned, or validated by the UK BioBank.

FUNPACK comes bundled with metadata about the variables present in UK BioBank data sets. This metadata can be obtained from the UK BioBank online data showcase

Installation

Install FUNPACK via pip:

pip install fmrib-unpack

Or from conda-forge:

conda install -c conda-forge fmrib-unpack

Introductory notebook

The funpack_demo command will start a Jupyter Notebook which introduces the main features provided by FUNPACK. To run it, you need to install a few additional dependencies:

pip install fmrib-unpack[demo]

You can then start the demo by running funpack_demo.

Usage

General usage is as follows:

funpack [options] output.tsv input1.tsv input2.tsv

You can get information on all of the options by typing funpack --help.

Options can be specified on the command line, and/or stored in a configuration file. For example, the options in the following command line:

funpack \
  --overwrite \
  --import_all \
  --log_file log.txt \
  --icd10_map_file icd_codes.tsv \
  --category 10 \
  --category 11 \
  output.tsv input1.tsv input2.tsv

Could be stored in a configuration file config.txt:

overwrite
import_all
log_file       log.txt
icd10_map_file icd_codes.tsv
category       10
category       11

And then executed as follows:

funpack -cfg config.txt output.tsv input1.tsv input2.tsv

Features

FUNPACK allows you to perform various data sanitisation and processing steps on your data, such as:

  • NA value replacement: Specific values for some columns can be replaced with NA, for example, variables where a value of -1 indicates Do not know.

  • Categorical recoding: Certain categorical columns can re-coded. For example, variables where a value of 555 represents half can be recoded so that 555 is replaced with 0.5.

  • Child value replacement: NA values within some columns which are dependent upon other columns may have values inserted based on the values of their parent columns.

See the introductory notebook for a more comprehensive overview of the features available in FUNPACK.

Built-in rules

FUNPACK contains a large number of built-in rules which have been specifically written to pre-process UK BioBank data variables. These rules are stored in the following files:

  • funpack/configs/fmrib/datacodings_*.tsv: Cleaning rules for data codings

  • funpack/configs/fmrib/variables_*.tsv: Cleaning rules for individual variables

  • funpack/configs/fmrib/processing.tsv: Processing steps

  • funpack/configs/fmrib/categories.tsv: Variable categories

You can use these rules by using the FMRIB configuration profile:

funpack -cfg fmrib output.tsv input.tsv

You can customise or replace these files as you see fit. You can also pass your own versions of these files to FUNPACK via the --variable_file, --datacoding_file, --type_file, --processing_file, and --category_file command-line options respectively. FUNPACK will load all variable and datacoding files, and merge them into a single table which contains the cleaning rules for each variable.

Creating your own rule files

To define rules at the data-coding level, create one or more .tsv files with an ID column containing the data-coding ID, and any of the following columns:

  • NAValues: A comma-separated list of values to replace with NA

  • RawLevels A comma-separated list of values to be replaced with corresponding values in NewLevels.

  • NewLevels A comma-separated list of replacement values for each of the values listed in RawLevels.

To apply these rules, pass your .tsv file(s) to funpack with the --datacoding_file option. They will be applied to all variables which use the data-coding(s) listed in the file(s).

To define rules at the variable level, create one or more .tsv files with an ID column containing the variable ID, and any of the following columns:

  • NAValues: As above

  • RawLevels As above

  • NewLevels As above

  • ParentValues: A comma-separated list of expressions on parent variables, defining conditions which should trigger child-value replacement.

  • ChildValues: A comma-separated list of values to insert into the variable when the corresponding expression in ParentValues evaluates to true.

  • Clean: A comma-separated list of cleaning functions to apply to the variable.

Output

The main output of FUNPACK is a plain-text tab-delimited[*]_ file which contains the input data, after cleaning and processing, potentially with some columns removed, and new columns added.

If you used the --non_numeric_file option, the main output file will only contain the numeric columns; non-numeric columns will be saved to a separate file.

You can use any tool of your choice to load this output file, such as Python, MATLAB, or Excel. It is also possible to pass the output back into FUNPACK.

Loading output into MATLAB

If you are using MATLAB, you have several options for loading the FUNPACK output. The best option is readtable, which will load column names, and will handle both non-numeric data and missing values. Use readtable like so:

data = readtable('out.tsv', 'FileType', 'text');

The readtable function returns a table object, which stores each column as a separate vector (or cell-array for non-numeric columns). If you are only interested in numeric columns, you can retrieve them as an array like this:

data    = data(:, vartype('numeric'));
rawdata = data.Variables;

The readtable function will potentially rename the column names to ensure that they are are valid MATLAB identifiers. You can retrieve the original names from the table object like so:

colnames        = data.Properties.VariableDescriptions;
colnames        = regexp(colnames, '''(.+)''', 'tokens', 'once');
empty           = cellfun(@isempty, colnames);
colnames(empty) = data.Properties.VariableNames(empty);
colnames        = vertcat(colnames{:});

If you have used the --description_file option, you can load in the descriptions for each column as follows:

descs = readtable('descriptions.tsv', ...
                  'FileType', 'text', ...
                  'Delimiter', '\t',  ...
                  'ReadVariableNames',false);
descs = [descs; {'eid', 'ID'}];
idxs  = cellfun(@(x) find(strcmp(descs.Var1, x)), colnames, ...
                'UniformOutput', false);
idxs  = cell2mat(idxs);
descs = descs.Var2(idxs);

Tests

To run the test suite, you need to install some additional dependencies:

pip install fmrib-unpack[test]

Then you can run the test suite using pytest:

pytest

Citing

If you would like to cite FUNPACK, please refer to its Zenodo page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fmrib-unpack-1.3.2.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fmrib_unpack-1.3.2-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file fmrib-unpack-1.3.2.tar.gz.

File metadata

  • Download URL: fmrib-unpack-1.3.2.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for fmrib-unpack-1.3.2.tar.gz
Algorithm Hash digest
SHA256 de435899b09e4b8d234536d24948f961b5033449fac37cde7936bb1628ab6446
MD5 5fd52996e63ba3ff2941c1732ceb92a8
BLAKE2b-256 08f815a3d2cbc6ceae75ca69173e359d30b6bc8062d84f6d090d67f136cd8bb7

See more details on using hashes here.

File details

Details for the file fmrib_unpack-1.3.2-py3-none-any.whl.

File metadata

  • Download URL: fmrib_unpack-1.3.2-py3-none-any.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8

File hashes

Hashes for fmrib_unpack-1.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 13b8f9c9b5c16874dde355739e42bbb313d4203a72912ff171ca380d58a60c8c
MD5 7df0b244092cfea0b0f3073983ab0297
BLAKE2b-256 165dc271b7d715a32e0fddc903e582cc689b0d751564b357b3d53c29f2026378

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page