Utility package for handling UK Biobank data

These details have not been verified by PyPI

Project links

Homepage

Project description

ukbiobank-loaders

This repository provides an easy way to load UK Biobank data. It is composed of a pre-processing script, which converts the UK Biobank data into parquets that are easier to read, and a library that provides different methods to access the data.

Installation

To install this package, simply run

pip install ukbiobank-loaders

Please note that python 3.7 or newer is needed.

Usage

We will now describe how to use this library. Please note that data can be read from both local directories, and aws s3 directories.

Pre-processing

These are the UK Biobank files that are needed in order to run the pre-processing, all saved in the same directory <DATA_FOLDER>:

death.txt
death_cause.txt
gp_clinical.txt
gp_scripts.txt
hesin.txt
hesin_diag.txt
hesin_oper.txt

Additionally, also the withdrawn consent file is needed:

withdrawn_consent.txt

From the terminal, run

update_data.py --raw_dir <DATA_FOLDER> --withdrawn_file <WITHDRAWN_CONSENT_FILE_PATH> --out_dir <OUTPUT_DIR_FOLDER>

The processed data will be saved in a folder named <OUTPUT_DIR_FOLDER>/final.

We found this process to take about 14 minutes in a pod composed of 4 CPUs and 32GB of RAM. If the process is Killed, it might be because there is not enough RAM available.

Accessing the data

This is a simple example on how to use the library. Specific documentation about the methods is given below.

>>> from ukbb_loaders.loaders import load
>>> dl = load.DataLoader(data_dir = "<OUTPUT_DIR_FOLDER>/final")
>>> dl.get_hospital_data("icd10")
    date_of_visit source feature  value
eid
68     1986-04-22  icd10    N181      1
68     1945-05-03  icd10    N181      1
68     1950-04-03  icd10    N181      1
68     1966-08-07  icd10    N181      1
67     1991-03-12  icd10    N181      1
..            ...    ...     ...    ...
73            NaT  icd10    N181      1
48     1997-06-20  icd10    N181      1
48     1945-03-05  icd10    N181      1
48     1956-02-25  icd10    N181      1
48     1981-04-08  icd10    N181      1

Documentation for ukbb_loaders.loaders.load

DataLoader

Loaders for versioned UKBB data.

DataLoader Objects

class DataLoader()

init

def __init__(data_dir: str)

Class for loading UKBB data.

Arguments:

data_dir str - The path to the directory containing the processed data. Note that on Windows the path must have forward slashes, e.g. "C:/Users/john/Documents/data_dir"

get_hospital_data

def get_hospital_data(source: Union[str, List[str]],
                      level=None,
                      patient_list: np.ndarray = None) -> pd.DataFrame

Arguments:

source str or list - The coding/representation/source we would like to fetch. It needs to be one or more of:
- icd10 - for fetching all icd10 related diagnoses.
- icd9 - for fetching all icd9 related diagnoses.
- opcs3 - for fetching all opcs4 related operational codes.
- opcs4 - for fetching all opcs4 related operational codes.
level list or string - The level/significance of diagnoses we would like to fetch. It needs to be one or both of:
- primary - for fetching only the primary code related to one diagnosis.
- secondary - for fetching all the secondary (complementary) codes for one diagnosis.
- external - For fetching diagnosis codes from external sources. Defaults to all of them.
patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

df pandas dataframe - A long canonical dataframe with patients as the index and the following columns:
- date_of_visit: pandas datetime for each hospital visit
- feature: the different codes used (e.g. the different icd10 codes)
- source: this is relevant to the source the feature is referring to (e.g. icd10)
- value: the occurrence value for each row combination (initially 1.)

get_death_data

def get_death_data(level=None,
                   patient_list: np.ndarray = None) -> pd.DataFrame

Method that fetches death information for the UKBB population.

Arguments:

level list or string - The level/significance of deaths we would like to fetch. It needs to be one or both of: primary (main reason of death), secondary. Defaults to both.
patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

df pandas dataframe : A long canonical dataframe with patients as the index and all recorded death information including death date in the right format.

get_gp_clinical_data

def get_gp_clinical_data(source=None, patient_list: np.ndarray = None)

Method that fetches gp diagnosis information for the UKBB population.

Arguments:

source str or list - Whether to load read_2, read_3 or both. Defaults to both.
patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

df pandas dataframe: A long canonical dataframe with patients as the index and all recorded gp information including date in the right format.

get_gp_medication_data

def get_gp_medication_data(patient_list: np.ndarray = None) -> pd.DataFrame

Arguments:

patient_list np.ndarray - The patients to fetch medication data for. If this is empty, all UKBB patients will be used.

Returns:

df pandas dataframe : A canonical long dataframe with patients as the index and features as columns.

Acknowledgments

This package is developed using the UK Biobank Resource under Application Number 43138.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.1.0

May 23, 2023

This version

1.0.0

Mar 30, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ukbiobank_loaders-1.0.0.tar.gz (11.8 kB view details)

Uploaded Mar 30, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ukbiobank_loaders-1.0.0-py3-none-any.whl (14.3 kB view details)

Uploaded Mar 30, 2023 Python 3

File details

Details for the file ukbiobank_loaders-1.0.0.tar.gz.

File metadata

Download URL: ukbiobank_loaders-1.0.0.tar.gz
Upload date: Mar 30, 2023
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for ukbiobank_loaders-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`00fc60863a3d481008161cff64db9d02429efed578734b539c8f2369175c6952`
MD5	`ee52e1f870bb7fbc1125af602c99b6c5`
BLAKE2b-256	`7e06586b7a6a87c863e2f2021364e8b1a3b74153b7f7d0052b72d4146d679240`

See more details on using hashes here.

File details

Details for the file ukbiobank_loaders-1.0.0-py3-none-any.whl.

File metadata

Download URL: ukbiobank_loaders-1.0.0-py3-none-any.whl
Upload date: Mar 30, 2023
Size: 14.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for ukbiobank_loaders-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e44163cc83c168cce1a3828e4ab1ec9022a3ff9d0cd616818b1a076798e74cb0`
MD5	`e6843c73f469405e3ccf4b231bb3d49b`
BLAKE2b-256	`c6e0fda9a5d91a72ad2a7ad57184993994367f2a01efa4adf8c3a94864b3866f`

See more details on using hashes here.

ukbiobank-loaders 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ukbiobank-loaders

Installation

Usage

Pre-processing

Accessing the data

Documentation for ukbb_loaders.loaders.load

Table of Contents

DataLoader Objects

init

get_hospital_data

get_death_data

get_gp_clinical_data

get_gp_medication_data

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes