Skip to main content

Module for sepsis predictions

Project description

Predictions sepsis

Instruction

Predictions sepsis is a module based on pandas, torch, and scikit-learn that allows users to perform simple operations with the MIMIC dataset. With this module, using just a few functions, you can train your model to predict whether some patients have certain diseases or not. By default, the module is designed to train and predict sepsis. The module also allows users to change different names of tables to aggregate data from.

Installation

To install the module, use the following command:

pip install predictions-sepsis

or

pip3 install predictions-sepsis

Usage

You can import functions from the module into your Python file to aggregate data from MIMIC, fill empty spots, compress data between patients, and train your model.

Examples

Aggregate patient diagnoses Data

import predictions_sepsis as ps

ps.get_diagnoses(patient_diagnoses_csv='path_to_patient_diagnoses.csv', 
                 all_diagnoses_csv='path_to_all_diagnoses.csv',
                 output_file_csv='gottenDiagnoses.csv')

Aggregate patient ssir Data

import predictions_sepsis as ps

ps.get_ssir(chartevents_csv='chartevents.csv', subject_id_col='subject_id', itemid_col='itemid',
             charttime_col='charttime', value_col='value', valuenum_col='valuenum', valueuom_col='valueuom',
             itemids=None, rest_columns=None, output_csv='ssir.csv'):

Combine Diagnoses and SSIR Data

import predictions_sepsis as ps

ps.combine_diagnoses_and_ssir(gotten_diagnoses_csv='gottenDiagnoses.csv', 
                              ssir_csv='path_to_ssir.csv',
                              output_file='diagnoses_and_ssir.csv')

Aggregate patient blood analysis data from chartevents.csv and labevents.csv and combine it with diagnoses and SSIR Data

import predictions_sepsis as ps

ps.merge_diagnoses_and_ssir_with_blood(diagnoses_and_ssir_csv='diagnoses_and_ssir.csv', 
                                       blood_csv='path_to_blood.csv',
                                       chartevents_csv='path_to_chartevents.csv',
                                       output_csv='merged_data.csv')
)

Compress Data by patient

import predictions_sepsis as ps

ps.compress(df_to_compress='balanced_data.csv', 
            output_csv='compressed_data.csv')

Choose top non-sepsis patients to balance

import predictions_sepsis as ps

ps.choose(compressed_df_csv='compressed_data.csv', 
          output_file='final_balanced_data.csv')

Fill missing values with mode

import predictions_sepsis as ps

ps.fill_values(balanced_csv='final_balanced_data.csv', 
               strategy='most_frequent', 
               output_csv='filled_data.csv')

Aggregate patient diagnoses Data

import predictions_sepsis as ps

# Aggregate diagnoses data
ps.get_diagnoses(patient_diagnoses_csv='path_to_patient_diagnoses.csv', 
                 all_diagnoses_csv='path_to_all_diagnoses.csv',
                 output_file_csv='gottenDiagnoses.csv')

Train model

import predictions_sepsis as ps
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
model = ps.train_model(df_to_train_csv='filled_data.csv', 
                       categorical_col=['Large Platelets'], 
                       columns_to_train_on=['Amylase'], 
                       model=RandomForestClassifier(), 
                       single_cat_column='White Blood Cells', 
                       has_disease_col='has_sepsis', 
                       subject_id_col='subject_id', 
                       valueuom_col='valueuom', 
                       scaler=MinMaxScaler(), 
                       random_state=42, 
                       test_size=0.2)

Second way

Collecting features of the dataset

with open(file_path) as f:
    headers = f.readline().replace('\n', '').split(',')
    i = 0
    for line in tqdm(f):
        values = line.replace('\n', '').split(',')
        subject_id = values[0]
        item_id = values[6]
        valuenum = values[8]
        if item_id in item_ids_set:
            if subject_id not in result:
                result[subject_id] = {}
            result[subject_id][item_id] = valuenum
        i += 1

table = pd.DataFrame.from_dict(result, orient='index')
table['subject_id'] = table.index

table.to_csv(output_path, index=False)

Add a target to the dataset

target_subjects = drgcodes.loc[drgcodes['drg_code'].isin([870, 871, 872]), 'subject_id']
merged_data.loc[merged_data['subject_id'].isin(target_subjects), 'diagnosis'] = 1

Filling in the blanks using the NoNa library

nona(
    data=X,
    algreg=make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=0.1)),
    algclass=RandomForestClassifier(max_depth=2, random_state=0)
)

Removing class imbalance using SMOTE

smote = SMOTE(random_state=random_state)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Train model TabNet

unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=pretraining_lr),
    mask_type=mask_type
)

unsupervised_model.fit(
    X_train=X_train.values,
    eval_set=[X_val.values],
    pretraining_ratio=pretraining_ratio,
)

clf = TabNetClassifier(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=training_lr),
    scheduler_params=scheduler_params,
    scheduler_fn=torch.optim.lr_scheduler.StepLR,
    mask_type=mask_type
)

clf.fit(
    X_train=X_train.values, y_train=y_train.values,
    eval_set=[(X_val.values, y_val.values)],
    eval_metric=['auc'],
    max_epochs=max_epochs,
    patience=patience,
    from_unsupervised=unsupervised_model
)

Looking at the metrics

result = loaded_clf.predict(X_test.values)
accuracy = (result == y_test.values).mean()
precision = precision_score(y_test.values, result)
recall = recall_score(y_test.values, result)
f1 = f1_score(y_test.values, result)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sickness_screening-1.0.0.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sickness_screening-1.0.0-py2-none-any.whl (17.7 kB view details)

Uploaded Python 2

File details

Details for the file sickness_screening-1.0.0.tar.gz.

File metadata

  • Download URL: sickness_screening-1.0.0.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.8.3 requests/2.27.1 setuptools/44.1.1 requests-toolbelt/1.0.0 tqdm/4.64.1 PyPy/7.3.16

File hashes

Hashes for sickness_screening-1.0.0.tar.gz
Algorithm Hash digest
SHA256 df3018c594ac090f945b785793ce59b1b36763b307258ba172f85f5c5484204d
MD5 a7bf0335b5eab7cad319b8ded54f6675
BLAKE2b-256 a863bba536bd8f8ce349b2ef78ced38b35e6a07bd24434d2f582cf88a58f75bc

See more details on using hashes here.

File details

Details for the file sickness_screening-1.0.0-py2-none-any.whl.

File metadata

  • Download URL: sickness_screening-1.0.0-py2-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.8.3 requests/2.27.1 setuptools/44.1.1 requests-toolbelt/1.0.0 tqdm/4.64.1 PyPy/7.3.16

File hashes

Hashes for sickness_screening-1.0.0-py2-none-any.whl
Algorithm Hash digest
SHA256 aa5a9a660e90c54ee396a30e0b4a7998528875871ad8cf633d98829f910b1dbb
MD5 6c7480c713f9ae2de2a04b807ae6fd06
BLAKE2b-256 5d3001f2a9f4d67d248167cd090450055af749c7076a5fb2f72d26a03a3b6d54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page