Code for working with the PSYCOP cohort
Project description
Installation
For development
pip install . -e
The -e
flag marks the install as editable, "overwriting" the package as you edit the source files.
Recommended to also add black as a pre-commit hook:
pre-commit install
For use
pip install git+https://github.com/Aarhus-Psychiatry-Research/psycop-ml-utils.git
Usage
Loading data from SQL
Currently only contains one function to load a view from SQL, sql_load
from loaders import sql_load
view = "[FOR_SFI_fritekst_resultat_udfoert_i_psykiatrien_aendret_2011]"
sql = "SELECT * FROM [fct]." + view
df = sql_load(sql, chunksize = None)
Flattening time series
To train baseline models (logistic regression, elastic net, SVM, XGBoost/random forest etc.), we need to represent the longitudinal data in a tabular, flattened way.
In essence, we need to generate a training example for each prediction time, where that example contains "latest_blood_pressure" (float), "X_diagnosis_within_n_hours" (boolean) etc.
To generate this, I propose the time-series flattener class (TimeSeriesFlattener
). It builds a dataset like described above.
TimeSeriesFlattener
class FlattenedDataset:
def __init__():
"""Class containing a time-series flattened.
Args:
prediction_times_df (DataFrame): Dataframe with prediction times.
prediction_timestamp_colname (str, optional): Colname for timestamps. Defaults to "timestamp".
id_colname (str, optional): Colname for patients ids. Defaults to "dw_ek_borger".
"""
def add_outcome():
"""Adds an outcome-column to the dataset
Args:
outcome_df (DataFrame): Cols: dw_ek_borger, datotid, value if relevant.
lookahead_days (float): How far ahead to look for an outcome in days. If none found, use fallback.
resolve_multiple (str): What to do with more than one value within the lookahead.
Suggestions: earliest, latest, mean, max, min.
fallback (List[str]): What to do if no value within the lookahead.
Suggestions: latest, mean_of_patient, mean_of_population, hardcode (qualified guess)
timestamp_colname (str): Column name for timestamps
values_colname (str): Colname for outcome values in outcome_df
id_colname (str): Column name for citizen id
new_col_name (str): Name to use for new col. Automatically generated as '{new_col_name}_within_{lookahead_days}_days'.
Defaults to using values_colname.
"""
def add_predictor():
"""Adds a predictor-column to the dataset
Args:
predictor_df (DataFrame): Cols: dw_ek_borger, datotid, value if relevant.
lookahead_days (float): How far ahead to look for an outcome in days. If none found, use fallback.
resolve_multiple (str): What to do with more than one value within the lookahead.
Suggestions: earliest, latest, mean, max, min.
fallback (List[str]): What to do if no value within the lookahead.
Suggestions: latest, mean_of_patient, mean_of_population, hardcode (qualified guess)
outcome_colname (str): What to name the column
id_colname (str): Column name for citizen id
timestamp_colname (str): Column name for timestamps
"""
Inspiration-code can be found in previous commits.
Example
- Update examples as API matures
import FlattenedDataset
dataset = FlattenedDataset(prediction_times_df = prediction_times, prediction_timestamp_colname = "timestamp", id_colname = "dw_ek_borger")
dataset.add_outcome(
outcome_df=type_2_diabetes_df,
lookahead_days=730,
resolve_multiple="max",
fallback=[0],
name="t2d",
)
dataset.add_predictor(
predictor=hba1c,
lookback_window=365,
resolve_multiple="max",
fallback=["latest", 40],
name="hba1c",
)
Dataset now looks like this:
dw_ek_borger | datetime_prediction | outc_t2d_within_next_730_days | pred_max_hba1c_within_prev_365_days |
---|---|---|---|
1 | yyyy-mm-dd hh:mm:ss | 0 | 48 |
2 | yyyy-mm-dd hh:mm:ss | 0 | 40 |
3 | yyyy-mm-dd hh:mm:ss | 1 | 44 |
For binary outcomes, add_predictor
with fallback = [0]
would take a df with only the times where the event occurred, and then generate 0's for the rest.
I propose we create the above functionality on a just-in-time basis, building the features as we need them.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for psycopmlutils-0.1.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65a1892abb4a239585cbeb019431788b0d9682728101b82ad64fcafed8813366 |
|
MD5 | 7b480193439e0dbac501d222d40ca1a4 |
|
BLAKE2b-256 | fb82c05d2d84cf541d0c7a8a3c41c4266a6a37fc47bfc4d9b2f7e4dd2d03791c |