Code for working with the PSYCOP cohort
Project description
Installation
For development
pip install . -e
The -e
flag marks the install as editable, "overwriting" the package as you edit the source files.
Recommended to also add black as a pre-commit hook:
pre-commit install
For use
pip install git+https://github.com/Aarhus-Psychiatry-Research/psycop-ml-utils.git
sql_load
Currently only contains one function to load a view from SQL, sql_load
from loaders import sql_load
view = ...
df = sql_load(...)
TimeSeriesFlattener
To train baseline models (logistic regression, elastic net, SVM, XGBoost/random forest etc.), we need to represent the longitudinal data in a tabular, flattened way.
In essence, we need to generate a training example for each prediction time, where that example contains "latest_blood_pressure" (float), "X_diagnosis_within_n_hours" (boolean) etc.
To generate this, I propose the time-series flattener class (TimeSeriesFlattener
). It builds a dataset like described above.
TimeSeriesFlattener
class FlattenedTimeSeries:
Attributes:
prediction_df (dataframe): Cols: dw_ek_borger, prediction_time, (value if relevant).
Methods:
add_outcome
outcome_df (dataframe): Cols: dw_ek_borger, datotid, (value if relevant).
lookahead_window (float): How far ahead to look for an outcome. If none found, use fallback.
resolve_multiple (str): How to handle more than one record within the lookbehind. Suggestions: earliest, latest, mean, max, min.
fallback (list): How to handle lack of a record within the lookbehind. Suggestions: latest, mean_of_patient, mean_of_population, hardcode (qualified guess)
name (str): What to name the column
add_predictor
predictor (dataframe): Cols: dw_ek_borger, datotid, (value if relevant).
lookback_window (float): How far back to look for a predictor. If none found, use fallback.
resolve_multiple (str): How to handle more than one record within the lookbehind. Suggestions: earliest, latest, mean, max, min.
fallback (list): How to handle lack of a record within the lookbehind. Suggestions: latest, mean_of_patient, mean_of_population, hardcode (qualified guess)
name (str): What to name the column
Inspiration-code can be found in previous commits.
Example
import FlattenedTimeSeries
dataset = FlattenedTimeSeries(prediction_df = prediction_times)
dataset.add_outcome(
outcome=type_2_diabetes,
lookahead_window=730,
resolve_multiple="max",
fallback=[0],
name="t2d",
)
dataset.add_predictor(
predictor=hba1c,
lookback_window=365,
resolve_multiple="max",
fallback=["latest", 40],
name="hba1c",
)
Dataset now looks like this:
dw_ek_borger | datetime_prediction | outc_t2d_within_next_730_days | pred_max_hba1c_within_prev_365_days |
---|---|---|---|
1 | yyyy-mm-dd hh:mm:ss | 0 | 48 |
2 | yyyy-mm-dd hh:mm:ss | 0 | 40 |
3 | yyyy-mm-dd hh:mm:ss | 1 | 44 |
For binary outcomes, add_predictor
with fallback = [0]
would take a df with only the times where the event occurred, and then generate 0's for the rest.
I propose we create the above functionality on a just-in-time basis, building the features as we need them.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for psycopmlutils-0.0.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 271a5b71c65252c75b682b594dbb5b90135e4621ec314ddb39cc0b0f94e4dfff |
|
MD5 | 4afaeb03549771d99d547333ce8deb52 |
|
BLAKE2b-256 | 75195f9d1a427933c0fd8ddf07f6327b98ee6f5b0d0ef9fc8bb4756fe385055f |