Skip to main content

The FMRIB UK Biobank data processing library

Project description

https://img.shields.io/pypi/v/ukbparse.svg https://anaconda.org/conda-forge/ukbparse/badges/version.svg https://zenodo.org/badge/DOI/10.5281/zenodo.1997626.svg https://git.fmrib.ox.ac.uk/fsl/ukbparse/badges/master/coverage.svg

ukbparse is a Python library for pre-processing of UK BioBank data.

ukbparse is developed at the Wellcome Centre for Integrative Neuroimaging (WIN@FMRIB), University of Oxford. ukbparse is in no way endorsed, sanctioned, or validated by the UK BioBank.

ukbparse comes bundled with metadata about the variables present in UK BioBank data sets. This metadata can be obtained from the UK BioBank online data showcase

Installation

Install ukbparse via pip:

pip install ukbparse

Or from conda-forge:

conda install -c conda-forge ukbparse

Introductory notebook

The ukbparse_demo command will start a Jupyter Notebook which introduces the main features provided by ukbparse. To run it, you need to install a few additional dependencies:

pip install ukbparse[demo]

You can then start the demo by running ukbparse_demo.

Usage

General usage is as follows:

ukbparse [options] output.tsv input1.tsv input2.tsv

You can get information on all of the options by typing ukbparse --help.

Options can be specified on the command line, and/or stored in a configuration file. For example, the options in the following command line:

ukbparse \
  --overwrite \
  --import_all \
  --log_file log.txt \
  --icd10_map_file icd_codes.tsv \
  --category 10 \
  --category 11 \
  output.tsv input1.tsv input2.tsv

Could be stored in a configuration file config.txt:

overwrite
import_all
log_file       log.txt
icd10_map_file icd_codes.tsv
category       10
category       11

And then executed as follows:

ukbparse -cfg config.txt output.tsv input1.tsv input2.tsv

Customising

ukbparse contains a large number of built-in rules which have been specifically written to pre-process UK BioBank data variables. These rules are stored in the following files:

  • ukbparse/data/variables_*.tsv: Cleaning rules for individual variables

  • ukbparse/data/datacodings_*.tsv: Cleaning rules for data codings

  • ukbparse/data/types.tsv: Cleaning rules for specific types

  • ukbparse/data/processing.tsv: Processing steps

You can customise or replace these files as you see fit. You can also pass your own versions of these files to ukbparse via the --variable_file, --datacoding_file, --type_file and --processing_file command-line options respectively. ukbparse will load all variable and datacoding files, and merge them into a single table which contains the cleaning rules for each variable.

Finally, you can use the --no_builtins option to bypass all of the built-in cleaning and processing rules.

Output

The main output of ukbparse is a plain-text tab-delimited[*]_ file which contains the input data, after cleaning and processing, potentially with some columns removed, and new columns added.

If you used the --non_numeric_file option, the main output file will only contain the numeric columns; non-numeric columns will be saved to a separate file.

You can use any tool of your choice to load this output file, such as Python, MATLAB, or Excel. It is also possible to pass the output back into ukbparse.

Loading output into MATLAB

If you are using MATLAB, you have several options for loading the ukbparse output. The best option is readtable, which will load column names, and will handle both non-numeric data and missing values. Use readtable like so:

data = readtable('out.tsv', 'FileType', 'text');

The readtable function returns a table object, which stores each column as a separate vector (or cell-array for non-numeric columns). If you are only interested in numeric columns, you can retrieve them as an array like this:

rawdata =  data(:, vartype('numeric')).Variables;

The readtable function will potentially rename the column names to ensure that they are are valid MATLAB identifiers. You can retrieve the original names from the table object like so:

colnames        = data.Properties.VariableDescriptions;
colnames        = regexp(colnames, '''(.+)''', 'tokens', 'once');
empty           = cellfun(@isempty, colnames);
colnames(empty) = data.Properties.VariableNames(empty);
colnames        = vertcat(colnames{:});

If you have used the --description_file option, you can load in the descriptions for each column as follows:

descs = readtable('descriptions.tsv', ...
                  'FileType', 'text', ...
                  'Delimiter', '\t',  ...
                  'ReadVariableNames',false);
descs = [descs; {'eid', 'ID'}];
idxs  = cellfun(@(x) find(strcmp(descs.Var1, x)), colnames, ...
                'UniformOutput', false);
idxs  = cell2mat(idxs);
descs = descs.Var2(idxs);

Tests

To run the test suite, you need to install some additional dependencies:

pip install ukbparse[test]

Then you can run the test suite using pytest:

pytest

Citing

If you would like to cite ukbparse, please refer to its Zenodo page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ukbparse-0.22.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

ukbparse-0.22.0-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file ukbparse-0.22.0.tar.gz.

File metadata

  • Download URL: ukbparse-0.22.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for ukbparse-0.22.0.tar.gz
Algorithm Hash digest
SHA256 69f3d71f648cd6596a671eddf49cee5de061de8f9ad3d2539f21efbde8039ea6
MD5 15a7683e30ca645d17d2e84fefc96014
BLAKE2b-256 0ddb617c984f124a99290e8eb5174ffb9351e45e55d29b44726c8efe19bd9302

See more details on using hashes here.

File details

Details for the file ukbparse-0.22.0-py3-none-any.whl.

File metadata

  • Download URL: ukbparse-0.22.0-py3-none-any.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for ukbparse-0.22.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d91d0ee367c32bcb473cd08a5fe9d0d1bbbfd332c76ecb116b1756c12c5d10e
MD5 7f5d811f68f71fe550f5bfd002fca84c
BLAKE2b-256 3dcecc027b0ca57c88953199e3c27aed6be3c9c03f1e68c3e9d4e5ae839cded8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page