Cardea

These details have been verified by PyPI

Maintainers

BFar furuicheng mit_dai_lab smish

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

“Cardea”

Cardea

This library is under development. Please contact dai-lab@mit.edu or any of the contributors for more information.

License: MIT
Development Status: Pre-Alpha
Homepage: https://github.com/MLBazaar/Cardea
Documentation: https://MLBazaar.github.io/Cardea

Overview

Cardea is a machine learning library built on top of schemas that support electronic health records (EHR). The library uses a number of AutoML tools developed under The Human Data Interaction Project at Data to AI Lab at MIT.

Our goal is to provide an easy to use library to develop machine learning models from electronic health records. A typical usage of this library will involve interacting with our API to develop prediction models.

process

A series of sequential processes are applied to build a machine learning model. These processes are triggered using our following APIs to perform the following:

loading data using the automatic data assembler, where we capture data from its raw format into an entityset representation.
data labeling where we create label times that generates (1) the time index that indicates the timespan for which I create my features (2) the encoded labels of the prediction task. this is essential for our feature engineering phase.
featurization for which we automatically feature engineer our data to generate a feature matrix.
lastly, we build, train, and tune our machine learning model using the modeling component.

to learn more about how we structure our machine learning process and our data structures, read our documentation here.

Quickstart

Install with pip

The easiest and recommended way to install Cardea is using pip:

pip install cardea

This will pull and install the latest stable release from PyPi.

Quickstart

In this short tutorial we will guide you through a series of steps that will help you get Cardea started.

First, load the core class to work with:

from cardea import Cardea

cardea = Cardea()

We then seamlessly plug in our data. Here in this example, we are loading a pre-processed version of the Kaggle dataset: Medical Appointment No Shows. To use this dataset download the data from here then unzip it in the root directory, or run the command:

curl -O https://dai-cardea.s3.amazonaws.com/kaggle.zip && unzip -d kaggle kaggle.zip

To load the data, supply the data to the loader using the following command:

cardea.load_entityset(data='kaggle')

:bulb: To load local data, pass the folder path to data.

To verify that the data has been loaded, you can find the loaded entityset by viewing cardea.es which should output the following:

Entityset: kaggle
  Entities:
    Address [Rows: 81, Columns: 2]
    Appointment_Participant [Rows: 6100, Columns: 2]
    Appointment [Rows: 110527, Columns: 5]
    CodeableConcept [Rows: 4, Columns: 2]
    Coding [Rows: 3, Columns: 2]
    Identifier [Rows: 227151, Columns: 1]
    Observation [Rows: 110527, Columns: 3]
    Patient [Rows: 6100, Columns: 4]
    Reference [Rows: 6100, Columns: 1]
  Relationships:
    Appointment_Participant.actor -> Reference.identifier
    Appointment.participant -> Appointment_Participant.object_id
    CodeableConcept.coding -> Coding.object_id
    Observation.code -> CodeableConcept.object_id
    Observation.subject -> Reference.identifier
    Patient.address -> Address.object_id

The output shown represents the entityset data structure where cardea.es is composed of entities and relationships. You can read more about entitysets here.

From there, you can select the prediction problem you aim to solve by specifying the name of the class, which in return gives us the label_times of the problem.

label_times = cardea.select_problem('MissedAppointment')

label_times summarizes for each instance in the dataset (1) what is its corresponding label of the instance and (2) what is the time index that indicates the timespan allowed for calculating features that pertain to each instance in the dataset.

          cutoff_time     instance_id        label
0 2015-11-10 07:13:56	      5030230       noshow
1 2015-12-03 08:17:28	      5122866    fulfilled
2 2015-12-07 10:40:59	      5134197    fulfilled
3 2015-12-07 10:42:42	      5134220       noshow
4 2015-12-07 10:43:01	      5134223       noshow

You can read more about label_times here.

Then, you can perform the AutoML steps and take advantage of Cardea.

Cardea extracts features through automated feature engineering by supplying the label_times pertaining to the problem you aim to solve

feature_matrix = cardea.generate_features(label_times[:1000])

:warning: Featurizing the data might take a while depending on the size of the data. For demonstration, we only featurize the first 1000 records.

Once we have the features, we can now split the data into training and testing

y = list(feature_matrix.pop('label'))

X = feature_matrix.values

X_train, X_test, y_train, y_test = cardea.train_test_split(
	X, y, test_size=0.2, shuffle=True)

Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model

cardea.select_pipeline('Random Forest')
cardea.fit(X_train, y_train)
y_pred = cardea.predict(X_test)

Finally, you can evaluate the performance of the model

cardea.evaluate(X, y, test_size=0.2, shuffle=True)

which returns the scoring metric depending on the type of problem

{'Accuracy': 0.75, 
 'F1 Macro': 0.5098039215686274, 
 'Precision': 0.5183001719479243, 
 'Recall': 0.5123528436411872}

Citation

If you use Cardea for your research, please consider citing the following paper:

Sarah Alnegheimish; Najat Alrashed; Faisal Aleissa; Shahad Althobaiti; Dongyu Liu; Mansour Alsaleh; Kalyan Veeramachaneni. Cardea: An Open Automated Machine Learning Framework for Electronic Health Records. IEEE DSAA 2020.

@inproceedings{alnegheimish2020cardea,
  title={Cardea: An Open Automated Machine Learning Framework for Electronic Health Records},
  author={Alnegheimish, Sarah and Alrashed, Najat and Aleissa, Faisal and Althobaiti, Shahad and Liu, Dongyu and Alsaleh, Mansour and Veeramachaneni, Kalyan},
  booktitle={2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)},
  pages={536--545},
  year={2020},
  organization={IEEE}
}

History

0.1.2 - 2021-02-19

New Modeler component

Invalid default metric name - Issue #82 by @ChengFR
Parameter 'presort' in 'sklearn.ensemble.GradientBoostingClassifier' has been deprecated since sklearn v0.22 - Issue #80 by @ChengFR
Loading demo data in either FHIR or MIMIC format - Issue #79 by @sarahmish
Fix a customized primitive: Categorizer - Issue #75 by @ChengFR
Update Cardea Class - Issue #73 by @sarahmish
Clean up the modeler - Issue #71 by @ChengFR
Update and clean up the dependencies - Issue #70 by @ChengFR

0.1.1 - 2020-12-11

Benchmark framework

Link google colab to Cardea and add badge README.md - Issue #67 by @sarahmish
Modeler load pipelines instead of lists of primitives enhancement - Issue #65 by @ChengFR
Benchmark testing apis enhancement - Issue #64 by @ChengFR
Update documentation theme enhancement - Issue #62 by @sarahmish
Primitive setup enhancement - Issue #61 by @sarahmish & @ChengFR

0.1.0 - 2020-09-15

Release on PyPI: https://pypi.org/project/cardea/

Analysis notebooks enhancement - Issue #58 by @sarahmish
MIMIC III data loader enhancement - Issue #57 by @sarahmish
Freeze package on analysis compatibility - Issue #55 by @sarahmish

Project details

These details have been verified by PyPI

Maintainers

BFar furuicheng mit_dai_lab smish

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.1.2

Feb 19, 2021

0.1.2.dev1 pre-release

Feb 19, 2021

0.1.2.dev0 pre-release

Feb 19, 2021

0.1.1

Dec 11, 2020

0.1.1.dev0 pre-release

Dec 11, 2020

0.1.0

Sep 15, 2020

0.1.0.dev3 pre-release

Sep 14, 2020

0.1.0.dev2 pre-release

Sep 14, 2020

0.1.0.dev1 pre-release

Sep 14, 2020

0.0.2

Mar 20, 2019

0.0.1

Sep 19, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cardea-0.1.2.tar.gz (421.6 kB view details)

Uploaded Feb 19, 2021 Source

Built Distribution

cardea-0.1.2-py2.py3-none-any.whl (401.0 kB view details)

Uploaded Feb 19, 2021 Python 2 Python 3

File details

Details for the file cardea-0.1.2.tar.gz.

File metadata

Download URL: cardea-0.1.2.tar.gz
Upload date: Feb 19, 2021
Size: 421.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.10

File hashes

Hashes for cardea-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`699204f2d0ca73b34b17d2b87b8e650d61e2a96899eaca520a5847b2751417ee`
MD5	`885c4797dacdc1e3836cf3d9062b3b36`
BLAKE2b-256	`799e4099b2dc4fb04e7f581cfd6cee12ee1405c67c6c90a0a0d3be89f43ada68`

See more details on using hashes here.

File details

Details for the file cardea-0.1.2-py2.py3-none-any.whl.

File metadata

Download URL: cardea-0.1.2-py2.py3-none-any.whl
Upload date: Feb 19, 2021
Size: 401.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.10

File hashes

Hashes for cardea-0.1.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`4313618624f8f99b02a359bd63f757c36388abb5ce86ff25c8dbbafed3635130`
MD5	`c868ad407ff456780450152f5b286772`
BLAKE2b-256	`1cd178ff3df13c5aeb952524a63821b3adbbf4bef92b4fca93928f4d07502c80`