Implementation of event-based models for degenerative diseases.
Project description
EBM
This is the python package for implementing Event Based Models for Disease Progression.
Installation
pip install alabebm
Change Log
-
2025-02-26 (V 0.3.4).
- Modified the
shuffle_orderfunction to ensure full derangement, making convergence faster.
- Modified the
-
2025-03-06 (V 0.4.0)
- use
pyproject.tomlinstead - update
conjuage_priors_algo.py, now without using the auxiliary variable ofparticipant_stages. Kept the uncertainties just like insoft_kmeans_algo.py.
- use
-
2025-03-07 (V 0.4.2)
- Compute
new_ln_likelihood_new_theta_phibased onnew_theta_phi_estimates, which is based onstage_likelihoods_posteriorsthat is based on the newly proposed order and previoustheta_phi_estimates. - Update
theta_phi_estimateswithnew_theta_phi_estimatesonly if new order is accepted. - The fallback theta_phi_estimates is the previous parameters rather than theta_phi_default
all_accepted_orders.append(current_order_dict.copy())to make sure the results are not mutated.- Previously I calculated the
new_ln_likelihoodandstage_likelihoods_posteriorsbased on the newly proposed order and previoustheta_phi_estimates, and directly update theta_phi_estimates whether we accept the new order or not. - Previously, I excluded
copy()inall_accepted_orders.append(current_order_dict.copy()), which is inaccurate.
- Compute
-
2025-03-17 (V 0.4.3)
- Added
skipandtitle_detailparameter insave_traceplotfunction.
- Added
-
2025-03-18 (V 0.4.4)
- Add optional horizontal bar indicating upper limit in trace plot.
-
2025-03-18 (V 0.4.7)
- Allowed keeping all cols (
keep_all_cols) in data generation.
- Allowed keeping all cols (
-
2025-03-18 (V 0.4.9)
- copy
data_we_haveand usedata_we_have.loc[:, 'S_n']in soft kmeans algo when preprocessing participant and biomarker data.
- copy
-
2025-03-20 (V 0.5.1)
- In hard kmeans, updated
delta = ln_likelihood - current_ln_likelihood, and in soft kmeans and conjugate priors, made sure I am usingdelta = new_ln_likelihood_new_theta_phi - current_ln_likelihood. - In each iteration, use
theta_phi_estimates = theta_phi_default.copy()first. This means,stage_likelihoods_posteriorsis based on the default theta_phi, not the previous iteration.
- In hard kmeans, updated
-
2025-03-21 (V 0.6.0)
- Integrated all three algorithms to just one file
algorithms/algorithm.py. - Changed the algorithm name of
soft_kmeanstomle(maximum likelihood estimation) - Moved all helper functions from the algorithm script to
utils/data_processing.py.
- Integrated all three algorithms to just one file
-
2025-03-22 (V 0.7.6)
- Current state should include both the current accepted order and its associated theta/phi. When updating theta/phi at the start of each iteration, use the current state's theta/phi (1) in the calculation of stage likelihoods and (2) as the fallback if either of the biomarker's clusters is empty or has only one measurement; (3) as the prior mean and variance.
- Set
conjugate_priorsas the default algorithm. - (Tried using cluster's mean and var as the prior but the results are not as good as using current state's theta/phi as the prior).
-
2025-03-24 (V 0.7.8)
- In heatmap, reorder the biomarkers according to the most likely order.
- In
results.jsonreorder the biomarker according to their order rather than alphabetically ranked. - Modified
obtain_most_likely_order_dicso that we assign stages for biomarkers that have the highest probabilities first. - In
results.json, output the order associated with the highest total log likelihood. Also, calculate the kendall's tau and p values of it and the original order (if provided).
-
2025-03-25 (V 0.8.1)
- In heatmap, reorder according to the order with highest log likelihood. Also, add the number just like (1).
- Able to add title detail to heatmaps and traceplots.
- Able to add
fname_prefixinrun_ebm().
-
2025-03-29 (V 0.8.9)
- Added
emalgorithm. - Added Dirichlet-Multinomial Model to describe uncertainy of stage distribution (a multinomial disribution of all disease stages; because we cannot always assume all disease stages are equally likely).
prior_vdefault set to be 1.- Default to use dirichlet distribution instead of uniform distribution
- Change data filename from 50|100_1 to 50_100_1.
- Modified the
mlealgorithm to make sure the output does not containnp.nan(by using the fallback).
- Added
-
2025-03-30 (V 0.9.2)
- Completed changed
generate_data.py. Now incorporates the modified data generation model based on DEBM2019. - Rank the original order by the value (ascending), if original order exists.
- Able to skip saving traceplots and/or heatmaps.
- Completed changed
-
2025-03-31 (V 0.9.4)
- Able to store final theta phi estimates and the final stage likelihood posteior to results.json
-
2025-04-02 (V 0.9.5)
- Added
kdealgorithm. - Initial kmeans used seeded Kmeans + conjugate priors.
- Added
-
2025-04-03 (V 0.9.7)
- Improved kde.
- Added dirichlet and beta parameters randomization.
-
2025-04-05 (V 0.9.9)
- Updated
generate_data.pyto align with the experimental design.
- Updated
-
2025-04-06 (V 0.9.9.3)
- Make
kmeans.pymore robust. Now try 100 times to randomize the assignment for the diseased group if the initial kmeans failed. - Add
algorithmandml_orderinresults.json - After generating data, create
true_order_and_stages_dictto store all filenames' true biomarker order and all participants' stages. For continuous kjs, usebisect_rightalgorithm to get the ranking order.
- Make
-
2025-04-07 (V 0.9.9.7)
- Modified
generate_data.pyto allow Experiment 9. - In
generate_data.py, added the function of randomly flipping the direction of progression in the sigmoid model. Also made sure this random direction is consistant across participants. - In the spirt of "Do not repeat yourself", delete "R" and "rho" in params.json. Instead, compute it each time when I generate data. The time difference is minimal.
- Added comparisons between true stages and most likely stages
- Reorganized the results.json
- Modified
Generate Random Data
from alabebm import generate, get_params_path, get_biomarker_order_path
import os
import json
# Get path to default parameters
params_file = get_params_path()
# Get path to biomarker_order
biomarker_order_json = get_biomarker_order_path()
with open(biomarker_order_json, 'r') as file:
biomarker_order = json.load(file)
generate(
biomarker_order = biomarker_order,
real_theta_phi_file=params_file, # Use default parameters
js = [50, 100],
rs = [0.1, 0.5],
num_of_datasets_per_combination=2,
output_dir='my_data',
seed = None,
prefix = None,
suffix = None,
keep_all_cols = False
)
Run MCMC Algorithms
from alabebm import run_ebm
from alabebm.data import get_sample_data_path
import os
print("Current Working Directory:", os.getcwd())
for algorithm in ['soft_kmeans', 'conjugate_priors', 'hard_kmeans']:
results = run_ebm(
data_file=get_sample_data_path('25|50_10.csv'), # Use the path helper
algorithm=algorithm,
n_iter=2000,
n_shuffle=2,
burn_in=1000,
thinning=20,
correct_ordering = None,
plot_title_detail = "",
)
Interpreting the results
After running the algorithm, you'll get the results in the folder of conjugate_priors, including
heatmaps. This folder contains the heatmap. Note that the number following each biomarker, such as (1), indicates the order of this biomarker according to the order that is associated with the highest likelihood (You can see the folder oftraceplotsfor the likelihood history.)recordscontains the logging information of the algorithm.traceplotscontains the traceplots of log likelihood trajectory.resultscontains json files. Example of a result json:
{
"n_iter": 200,
"most_likely_order": {
"HIP-FCI": 1,
"PCC-FCI": 2,
"FUS-GMI": 3,
"P-Tau": 4,
"AB": 5,
"HIP-GMI": 6,
"MMSE": 7,
"ADAS": 8,
"AVLT-Sum": 9,
"FUS-FCI": 10
},
"kendalls_tau": 0.6,
"p_value": 0.016666115520282188,
"original_order": {
"HIP-FCI": 1,
"PCC-FCI": 2,
"AB": 3,
"P-Tau": 4,
"MMSE": 5,
"ADAS": 6,
"HIP-GMI": 7,
"AVLT-Sum": 8,
"FUS-GMI": 9,
"FUS-FCI": 10
},
"order_with_higest_ll": {
"HIP-FCI": 1,
"PCC-FCI": 2,
"FUS-GMI": 3,
"AB": 4,
"P-Tau": 5,
"HIP-GMI": 6,
"MMSE": 7,
"ADAS": 8,
"AVLT-Sum": 9,
"FUS-FCI": 10
},
"kendalls_tau2": 0.6444444444444444,
"p_value2": 0.009148478835978836
}
n_iter means the number of iterations. most_likely_order is the most likely order if we consider all the iteration results, burn in, and thinning. kendalls_tau and p_value is the result of most likely order versus the original order (if provided). order_with_higest_ll is the order associated with the highest log likelihood. kendalls_tau2 and p_value2 is the result of most likely order versus the original order (if provided).
Input data
The input data should have at least four columns:
- participant: int
- biomarker: str
- measurement: float
- diseased: bool
An example is https://raw.githubusercontent.com/hongtaoh/alabEBM/refs/heads/main/alabEBM/tests/my_data/10%7C100_0.csv
The data should be in a tidy format, i.e.,
- Each variable is a column.
- Each observation is a row.
- Each type of observational unit is a table.
Features
-
Multiple MCMC algorithms:
- Conjugate Priors
- Hard K-means
- MLE
-
Data generation utilities
-
Extensive logging
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alabebm-0.9.9.7.tar.gz.
File metadata
- Download URL: alabebm-0.9.9.7.tar.gz
- Upload date:
- Size: 72.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
522bf8195d6f5d6b5da2354119ec868447e4840186ac9ed6a8bc59a07726ce90
|
|
| MD5 |
16fe2b8a279131c310eb6f795ce84cc7
|
|
| BLAKE2b-256 |
d898ff42ea7c9a5fcd4170b95d1a0780c6ad7f9aee18a6b4eee6ec8b349d05e2
|
File details
Details for the file alabebm-0.9.9.7-py3-none-any.whl.
File metadata
- Download URL: alabebm-0.9.9.7-py3-none-any.whl
- Upload date:
- Size: 75.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd30c39420b324594ca401de4b1dda00033550b2bebecd9026338a70408bd183
|
|
| MD5 |
b0cb00cce7b7410eb3343fc2683752d5
|
|
| BLAKE2b-256 |
faddc34504e611cdb54e52550a5d16e6ddd2890f4def99120cd0f217d878125f
|