Skip to main content

Utility package for handling UK Biobank data

Project description

ukbiobank-loaders

This repository provides an easy way to load UK Biobank data. It is composed of a pre-processing script, which converts the UK Biobank data into parquets that are easier to read, and a library that provides different methods to access the data.

Installation

To install this package, simply run

pip install ukbiobank-loaders

Please note that python 3.7 or newer is needed.

Usage

We will now describe how to use this library. Please note that data can be read from both local directories, and aws s3 directories.

Pre-processing

These are the UK Biobank files that are needed in order to run the pre-processing, all saved in the same directory <DATA_FOLDER>:

death.txt
death_cause.txt
gp_clinical.txt
gp_scripts.txt
hesin.txt
hesin_diag.txt
hesin_oper.txt

Additionally, also the withdrawn consent file is needed:

withdrawn_consent.txt

From the terminal, run

update_data.py --raw_dir <DATA_FOLDER> --withdrawn_file <WITHDRAWN_CONSENT_FILE_PATH> --out_dir <OUTPUT_DIR_FOLDER>

The processed data will be saved in a folder named <OUTPUT_DIR_FOLDER>/final.

We found this process to take about 14 minutes in a pod composed of 4 CPUs and 32GB of RAM. If the process is Killed, it might be because there is not enough RAM available.

Accessing the data

This is a simple example on how to use the library. Specific documentation about the methods is given below.

>>> from ukbb_loaders.loaders import load
>>> dl = load.DataLoader(data_dir = "<OUTPUT_DIR_FOLDER>/final")
>>> dl.get_hospital_data("icd10")
    date_of_visit source feature  value
eid
68     1986-04-22  icd10    N181      1
68     1945-05-03  icd10    N181      1
68     1950-04-03  icd10    N181      1
68     1966-08-07  icd10    N181      1
67     1991-03-12  icd10    N181      1
..            ...    ...     ...    ...
73            NaT  icd10    N181      1
48     1997-06-20  icd10    N181      1
48     1945-03-05  icd10    N181      1
48     1956-02-25  icd10    N181      1
48     1981-04-08  icd10    N181      1

Documentation for ukbb_loaders.loaders.load

Table of Contents

Loaders for versioned UKBB data.

DataLoader Objects

class DataLoader()

__init__

def __init__(data_dir: str)

Class for loading UKBB data.

Arguments:

  • data_dir str - The path to the directory containing the processed data. Note that on Windows the path must have forward slashes, e.g. "C:/Users/john/Documents/data_dir"

get_hospital_data

def get_hospital_data(source: Union[str, List[str]],
                      level=None,
                      patient_list: np.ndarray = None) -> pd.DataFrame

Arguments:

  • source str or list - The coding/representation/source we would like to fetch. It needs to be one or more of:
    • icd10 - for fetching all icd10 related diagnoses.
    • icd9 - for fetching all icd9 related diagnoses.
    • opcs3 - for fetching all opcs4 related operational codes.
    • opcs4 - for fetching all opcs4 related operational codes.
  • level list or string - The level/significance of diagnoses we would like to fetch. It needs to be one or both of:
    • primary - for fetching only the primary code related to one diagnosis.
    • secondary - for fetching all the secondary (complementary) codes for one diagnosis.
    • external - For fetching diagnosis codes from external sources. Defaults to all of them.
  • patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

  • df pandas dataframe - A long canonical dataframe with patients as the index and the following columns:
    • date_of_visit: pandas datetime for each hospital visit
    • feature: the different codes used (e.g. the different icd10 codes)
    • source: this is relevant to the source the feature is referring to (e.g. icd10)
    • value: the occurrence value for each row combination (initially 1.)

get_death_data

def get_death_data(level=None,
                   patient_list: np.ndarray = None) -> pd.DataFrame

Method that fetches death information for the UKBB population.

Arguments:

  • level list or string - The level/significance of deaths we would like to fetch. It needs to be one or both of: primary (main reason of death), secondary. Defaults to both.
  • patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

  • df pandas dataframe : A long canonical dataframe with patients as the index and all recorded death information including death date in the right format.

get_gp_clinical_data

def get_gp_clinical_data(source=None, patient_list: np.ndarray = None)

Method that fetches gp diagnosis information for the UKBB population.

Arguments:

  • source str or list - Whether to load read_2, read_3 or both. Defaults to both.
  • patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

  • df pandas dataframe: A long canonical dataframe with patients as the index and all recorded gp information including date in the right format.

get_gp_medication_data

def get_gp_medication_data(patient_list: np.ndarray = None) -> pd.DataFrame

Arguments:

  • patient_list np.ndarray - The patients to fetch medication data for. If this is empty, all UKBB patients will be used.

Returns:

  • df pandas dataframe : A canonical long dataframe with patients as the index and features as columns.

Acknowledgments

This package is developed using the UK Biobank Resource under Application Number 43138.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ukbiobank_loaders-1.0.0.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ukbiobank_loaders-1.0.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file ukbiobank_loaders-1.0.0.tar.gz.

File metadata

  • Download URL: ukbiobank_loaders-1.0.0.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for ukbiobank_loaders-1.0.0.tar.gz
Algorithm Hash digest
SHA256 00fc60863a3d481008161cff64db9d02429efed578734b539c8f2369175c6952
MD5 ee52e1f870bb7fbc1125af602c99b6c5
BLAKE2b-256 7e06586b7a6a87c863e2f2021364e8b1a3b74153b7f7d0052b72d4146d679240

See more details on using hashes here.

File details

Details for the file ukbiobank_loaders-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ukbiobank_loaders-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e44163cc83c168cce1a3828e4ab1ec9022a3ff9d0cd616818b1a076798e74cb0
MD5 e6843c73f469405e3ccf4b231bb3d49b
BLAKE2b-256 c6e0fda9a5d91a72ad2a7ad57184993994367f2a01efa4adf8c3a94864b3866f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page