The FUNPACK FMRIB configuration profile
Project description
FUNPACK - FMRIB configuration profile
FUNPACK is a Python
library for pre-processing of UK BioBank data. The fmrib-unpack-fmrib-config
package contains a configuration profile for FUNPACK which encodes a large set
of cleaning and processing rules for a range of UK BioBank data fields.
FUNPACK depends on fmrib-unpack-fmrib-config
, so if FUNPACK is installed,
then you already have the fmrib
configuration profile, and can use it like
so:
fmrib_unpack -cfg fmrib_standard out.tsv <input.csv>
Overview
The FUNPACK fmrib_standard
configuration profile performs the following
steps. This is an overview - refer to the configuration files for all details:
Data import
All data-fields from the categories listed in
fmrib_cats.cfg
are imported. These
categories are defined in
categories.tsv
. Data fields which
are not in any of these categories are not imported.
Notes:
- Some data-field categories which are not of direct interest are explicitly excluded (currently category 100).
- Some categories (specifically 1, 31, 60, 70, 96, 98, and 99) contain secondary/auxillary data-fields which are not of direct interest, but need to be in the output file. These categories are excluded from some processing steps (see below).
Cleaning/preprocessing
-
NA value replacement (removing certain values and replacing them with NA) is performed on all data fields which use the data codings listed in
datacodings_navalues.tsv
. -
All date/time data-fields are converted into floating point numbers of the form
<YYYY>.fraction
. This rule is defined indatetime_formatting.tsv
, and the conversion logic defined in thefunpack.plugins.fmrib
module. -
Categorical quantitative recoding (e.g. replacing potentially quantitative quantised/categorical codings with more monotonic/sensible codings) is performed on all data fields which use the data codings listed in
datacodings_recoding.tsv
. -
Child value replacement (inferring the values of missing data-fields based on responses to parent data-fields) is performed on all data-fields listed in
variables_parentvalues.tsv
.
Processing
All subsequent processing steps are specified in
processing.tsv
, and are described
here:
-
A number of categorical data fields are binarised - a separate column is created for each category, with a
1
for subjects in that category, or a0
otherwise. -
ICD9 and ICD10 data-fields 41270, and 41271 are binarised, but instead of containing
1
/0
, they contain the corresponding diagnosis dates, taken respectively from data-fields 41280, and 41281. -
Sparse columns are removed. For most data-fields, a column is deemed sparse if any of these conditions hold:
- Contains 50 or fewer data points
- Has a standard deviation of less than
1e-6
(only applied to numeric data-fields) - If categorical, one category comprises 99% or more of all data Data-fields from secondary/auxillary categories are excluded from this sparsity test.
-
Columns which were binarised as outlined above are subjected to a different sparsity test - any columns which have less than 10 non-0 entries are dropped.
-
Redundant columns are removed. Correlation and missingness correlation is calculated between all pairs of columns. If the correlation between a pair of columns exceeds 0.99 and the missingness correlation exceeds 0.2, the column with more missing values is removed. ICD9/10 columns are excluded from this step, along with data-fields from secondary/auxillary categories.
-
New binary columns are generated for the ICD9 and ICD10 in-patient hospital diagnosis data fields 41270, and 41271 (for the columns remaining after the sparsity/redundancy tests) indicating, for each diagnosis, whether it was a primary or secondary diagnosis. This information is obtained from data-fields 41202, 41203, 41204, and 41205, which are subsequently removed from the data set.
Notes on ICD9/ICD10 data-fields
ICD10 in-patient hospital diagnosis codes are available in the raw data in the following data fields:
41201: As above, but containing external causes only. Corresponding dates are not available in a separate data field, (but are available in 41270/41280).
41202: As above, but containing primary diagnoses only. Corresponding dates are given in 41262.
41204: As above, but containing secondary diagnoses only. Corresponding dates are not available in a separate data field, (but are available in 41270/41280).
ICD9 diagnosis codes follow the same structure, and are available in data fields 41271 (all diagnoses, dates in 41281), 41203 (primary diagnoses, dates in 41263, and [41205]((https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41205) (secondary diagnoses).
In the output data, data-fields 41270 (ICD10) and 41271 (ICD9) are re-arranged so that there is one column per diagnosis code. These columns are named as
41270-<code>
or41271-<code>
, e.g.41270-A044
, and contain the diagnosis date (taken from 41280 and 41281) for subjects with the diagnosis, or a0
for subjects without the diagnosis.Binary columns are also generated for each diagnosis code indicating whether it was a primary or secondary diagnosis - this information is obtained from data fields 41202, 41203, 41204, and 41205. These columns are given names:
41202-<code>.primary
41203-<code>.secondary
41204-<code>.primary
41205-<code>.secondary
Output files
For this command:
fmrib_unpack -cfg fmrib_standard out.tsv <input.csv>
All processed data-fields will be saved to out.tsv
. Note that all non-numeric
columns are removed, so this file only contains numeric columns.
The following files are also saved:
out_log.txt
: Log messages, useful for troubleshootingout_summary.txt
: Summary of all rules applied to every data-field in the input fileout_description.txt
: Description of every column in the output file.out_icd10_map.txt
: Every ICD10 diagnosis code in the output file, along with their equivalent numeric code, and text desccription
The fmrib_new_release
profile (see below) also produces:
out_unknown_vars.txt
: List of all columns from previously unknown/uncategorised data-fields, and whether or not they passed processing and were exported.
Other configuration profiles
The fmrib_standard
profile, as described above, is used within FMRIB for the
preprocessing of all non-imaging UKB data. Some other configurations profiles
are also available:
fmrib
: As above, but all data-fields present in the input file(s) are loaded, and logging/additional output files are not generated.fmrib_new_release
: Equivalent tofmrib_standard
, but load and process all data-fields (except those in explicitly excluded categories), and output a summary of any unknown/ uncategorised data-fields.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fmrib-unpack-fmrib-config-1.7.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 776f1de1ddc016e6a78fb199b151f0ee4218ef9a65b0669196d8e806c923c7f9 |
|
MD5 | ef0d14555cd618b150137e9d9b9a1f3a |
|
BLAKE2b-256 | f7f6b72ddfb863e1ade3368ddc9b2cb6709bc86556bb04ca5bf2143e83463f1c |
Hashes for fmrib_unpack_fmrib_config-1.7.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4923fe8b04854298ba50acb17bfc00ffe608f8051d2634dcdfff6b436654b82 |
|
MD5 | 3024618ec139ab1d87e1bf259ce1fcf5 |
|
BLAKE2b-256 | 965cfacdde9a1864254a1d6a3934402b09cde2d74191c83b869303ae06e1b0c2 |