Stage-Aware Event-Based Modeling for Disease Progression
Project description
fastsaebm
fastsaebm is a Python package for generating and analyzing biomarker data using Stage-Aware Event-Based Modeling (SA-EBM). It supports various data generation experiments and EBM algorithms to estimate biomarker orderings and disease stages. This package is designed for researchers and data scientists working with biomarker progression analysis.
For detailed methodology, refer to our paper.
Installation
Install fastsaebm using pip:
pip install fastsaebm
Ensure you have Python 3.8+ and the required dependencies installed. For a full list of dependencies, see requirements.txt.
Data generation
Examples of how to generate data are at ./pysaebm/test/gen.py.
Because in each generation, the ordering is randomized, you will see a true_order_and_stages.json that tells you the corresponding true stages and true order for each output csv file.
The source codes for data generation can be seen in ./pysaebm/data_generate.py.
This is the full generate parameters:
def generate(
experiment_name: str = "sn_kjOrdinalDM_xnjNormal",
params_file: str = 'params.json',
js: List[int] = [50, 200, 500, 1000],
rs: List[float] = [0.1, 0.25, 0.5, 0.75, 0.9],
num_of_datasets_per_combination: int = 50,
output_dir: str = 'data',
seed: Optional[int] = None,
dirichlet_alpha: Optional[Dict[str, List[float]]] = {
'uniform': [100],
'multinomial': [0.4013728324975898,
1.0910444770153345,
2.30974117596663,
3.8081194066281103,
4.889722107892335,
4.889722107892335,
3.8081194066281103,
2.30974117596663,
1.0910444770153345,
0.4013728324975898]
},
beta_params: Dict[str, Dict[str, float]] = {
'near_normal': {'alpha': 2.0, 'beta': 2.0},
'uniform': {'alpha': 1, 'beta': 1},
'regular': {'alpha': 5, 'beta': 2}
},
prefix: Optional[str] = None,
suffix: Optional[str] = None,
keep_all_cols: bool = False,
fixed_biomarker_order: bool = False,
noise_std_parameter: float = 0.05,
) -> Dict[str, Dict[str, int]]:
Explanations:
experiment_nameshould be one of the these:
experiment_names = [
"sn_kjOrdinalDM_xnjNormal", # Experiment 1: Ordinal kj with Dirichlet-Multinomial, Normal Xnj
"sn_kjOrdinalDM_xnjNonNormal", # Experiment 2: Ordinal kj with Dirichlet-Multinomial, Non-Normal Xnj
"sn_kjOrdinalUniform_xnjNormal", # Experiment 3: Ordinal kj with Uniform distribution, Normal Xnj
"sn_kjOrdinalUniform_xnjNonNormal", # Experiment 4: Ordinal kj with Uniform distribution, Non-Normal Xnj
"sn_kjContinuousUniform", # Experiment 5: Continuous kj with Uniform distribution
"sn_kjContinuousBeta", # Experiment 6: Continuous kj with Beta distribution
"xiNearNormal_kjContinuousUniform", # Experiment 7: Near-normal Xi with Continuous Uniform kj
"xiNearNormal_kjContinuousBeta", # Experiment 8: Near-normal Xi with Continuous Beta kj
"xiNearNormalWithNoise_kjContinuousBeta", # Experiment 9: Same as Exp 8 but with noises to xi
]
You can find the explanation to these terms from our paper.
params_file: The path to the parameters in json. Example is ./pysaebm/data/params.json. You should specify each biomarker'stheta_mean,theta_std,phi_mean, andphi_std.js: An array of integers indicating the number of participants you want.rs: An array of floats indicating the number of healthy ratios.num_of_datasets_per_combination: The number of repetitions for each j-r combination.output_dir: The directory where you want to save the generated data.seed: An integer serving as the seed for the randomness.dirichlet_alpha: This should be a dictionary where keys areuniformandmultinomial. They correspond tokjOrdinalUniformandkjOrdinalDM.beta_params: A dictionary where keys arenear_normal,uniform, andregular, corresponding toxiNearNormal,kjContinuousUniformandkjContinuousBeta.prefix: Optional prefix for the output csv file names.suffix: Optional suffix for the output csv file names.keep_all_cols: Whether to include additional metadata columns (k_j, event_time, affected)- fixed_biomarker_order: If True, will use the order as in the
params_file. If False, will randomize the ordering. - noise_std_parameter: the parameter in N(0, N \cdot noise_std_parameter) in experiment 9
Note that you need to make sure the dirichlet_alpha['multinomial'] has the same length as your params dict (as in your params_file).
Run EBM Algorithms
Examples of how to run algorithms and get results is at ./pysaebm/test/test.py.
This explains the parameters well enough:
def run_ebm(
algorithm: str,
data_file: str,
output_dir: str,
output_folder: Optional[str] = None,
n_iter: int = 2000,
n_shuffle: int = 2,
burn_in: int = 500,
thinning: int = 1,
true_order_dict: Optional[Dict[str, int]] = None,
true_stages: Optional[List[int]] = None,
plot_title_detail: Optional[str] = "",
fname_prefix: Optional[str] = "",
skip_heatmap: Optional[bool] = False,
skip_traceplot: Optional[bool] = False,
# Strength of the prior belief in prior estimate of the mean (μ), set to 1 as default
prior_n: float = 1.0,
# Prior degrees of freedom, influencing the certainty of prior estimate of the variance (σ²), set to 1 as default
prior_v: float = 1.0,
bw_method: str = 'scott',
seed: int = 123,
) -> Dict[str, Union[str, int, float, Dict, List]]:
"""
Run the metropolis hastings algorithm and save results
Args:
algorithm (str): Choose from 'hard_kmeans', 'mle', 'em', 'kde', and 'conjugate_priors'.
data_file (str): Path to the input CSV file with biomarker data.
output_dir (str): Path to the directory to store all the results.
output_folder (str): Optional. If not provided, all results will be saved to output_dir/algorithm.
If provided, results will be saved to output_dir/output_folder
n_iter (int): Number of iterations for the Metropolis-Hastings algorithm.
n_shuffle (int): Number of shuffles per iteration.
burn_in (int): Burn-in period for the MCMC chain.
thinning (int): Thinning interval for the MCMC chain.
true_order_dict (Optional[Dict[str, int]]): biomarker name: the correct order of it (if known)
true_stages (Optional[List[int]]): true stages for all participants (if known)
plot_title_detail (Optional[str]): optional string to add to plot title, as suffix.
fname_prefix (Optional[str]): the prefix of heatmap, traceplot, results.json, and logs file, e.g., 5_50_0_heatmap_conjugate_priors.png
In the example, there are no prefix strings.
skip_heatmap (Optional[bool]): whether to save heatmaps. True if you want to skip saving heatmaps and save space.
skip_traceplot (Optional[bool]): whether to save traceplots. True if you want to skip saving traceplots and save space.
prior_n (strength of belief in prior of mean): default to be 1.0
prior_v (prior degree of freedom) are the weakly informative priors, default to be 1.0
bw_method (str): bandwidth selection method in kde
seed (int): for reproducibility
Returns:
Dict[str, Union[str, int, float, Dict, List]]: Results including everything, e.g., Kendall's tau and p-value.
"""
Some extra explanations:
n_iter: In general, above 2k is recommended. 10k should be sufficient if you have <= 10 biomarkers.burn_inandthinning: The idea behind the two parameters is that we will only use some of the results from all iterations inn_iter. We will do this: if(i > burn_in) & (i % thinning == 0), then we will use the result from that iterationi. Usually, we setthinningto be 1.
After running the run_ebm, you'll see a folder named as your output_dir. Each algorithm will have its subfolders.
The results are organized this way:
recordsfolder contains the loggings.heatmapsfolder contains all the heatmaps. An example is
Biomarkers in the y-axis are ranked according to the ordering that has the highest likelihood among all iterations. Each cell indicates the probability that a certain biomarker falls in a certain stage. Note, however, these probabilities are calculated based on all the iterations that satisfy (i > burn_in) & (i % thinning == 0).
In the heatmap, the sum of each col and each row is 1.
traceplotsfolder contains all the traceplots (starting from iteration 40, not iteration 0). Those plots will be useful to diagnose whether EBM algorithms are working correctly. It's totally okay for the plots to show fluctuation (because biomarker distribution and stage distributions are re-calculated each iterations). You should not, however, see a clear downward trend.
resultsfolder contains all important results injson. Each file contains
results = {
"algorithm": algorithm,
"runtime": end_time - start_time,
"N_MCMC": n_iter,
"n_shuffle": n_shuffle,
"burn_in": burn_in,
"thinning": thinning,
'healthy_ratio': healthy_ratio,
"max_log_likelihood": float(max(log_likelihoods)),
"kendalls_tau": tau,
"p_value": p_value,
"mean_absolute_error": mae,
'current_pi': current_pi.tolist(),
# updated pi is the pi for all stages, including 0
'updated_pi': updated_pi.tolist(),
'true_order': true_order_result,
"order_with_highest_ll": {k: int(v) for k, v in zip(biomarker_names, order_with_highest_ll)},
"true_stages": true_stages,
'ml_stages': ml_stages,
"stage_likelihood_posterior": final_stage_post_dict,
"final_theta_phi_params": final_theta_phi_dict,
}
algorithmis the algorithm you are running with.runtimein seconds.N_MCMCis the number of MCMC iterations.n_shuffleis how many biomarkers to shuffle places in Metropolis Hastings. Default is 2.burn_inandthinningare explained above.healthy_ratiois the percentage of non-diseased participants in the dataset.max_log_likelihoodis the max data log likelihood. The scriptmhgeneratesall_accepted_orders,log_likelihoods,current_theta_phi,current_stage_post, andcurrent_pi.max_log_likelihoodsis the max oflog_likelihoods.kendalls_tauandp_valueare the result of comparing theorder_with_highest_llwith the true order (if provided).mean_absolute_erroris the result of comparingml_stagesandtrue_stages(if provided). This measurement is for when we include estimation of healthy participants as well. That is to say, we estimate their disease stages without knowing they are healthy.current_piis the probability distribution of all disease stages.current_piis the probability distribution of all stages, including 0.true_orderis the dictionary explaining the stage each biomarker is at.order_with_highest_llis the ordering that corresponds to themax_log_likelihoods.true_stagesis an array of each participant's disease stage.ml_stagesis the array of most likely disease stages.true_stages_diseasedis the array of actual stages for diseased participants only.'ml_stages_diseasedis the array of the estimated stages for diseased participants only.stage_likelihood_posterioris a dictionary detailing each participant's probability in each of the possible stages.final_theta_phi_paramsis a dictionary detailing the parameters for each biomarker. If you are usingkde, you'll seetheta_weightsandphi_weights. If you use other algorithms, you will see thetheta_mean,theta_std,phi_meanandphi_std.
Use your own data
You are more than welcome to use your own data. After all, the very purpose of pysaebm is: to allow you to analyze your own data. However, you do have to make sure that the input data have at least four columns:
- participant: int
- biomarker: str
- measurement: float
- diseased: bool
The participant column should be integers from 0 to J-1 where J is the number of participants.
Samples are available at ./pysaebm/data/samples/.
The data should be in a tidy format, i.e.,
- Each variable is a column.
- Each observation is a row.
- Each type of observational unit is a table.
Change Log
- 2025-08-03 (V 0.0.6)
- Implemented the numpy version.
- Implemented numba in functions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastsaebm-0.0.6.tar.gz.
File metadata
- Download URL: fastsaebm-0.0.6.tar.gz
- Upload date:
- Size: 54.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff596dc473c93b9ef092a054d8081d0b73c9956f5509d9e298159d8bfe415ffe
|
|
| MD5 |
fb58feef383627281ca3758804144e0e
|
|
| BLAKE2b-256 |
d3eab9fa7d9eef1c847d6d9860e0a0bf0d22b4182853c8adcf40f39f677142bf
|
File details
Details for the file fastsaebm-0.0.6-py3-none-any.whl.
File metadata
- Download URL: fastsaebm-0.0.6-py3-none-any.whl
- Upload date:
- Size: 52.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b3eaaebdfc4f2ac1eb087d6049b46c0b536c5209867916aab8ff1ae21a9d9da
|
|
| MD5 |
4bf15d74f3c04e66801c3c31276448b6
|
|
| BLAKE2b-256 |
8e52781074d89f005d86816be672e5e6094d4df8146e6d5f88986205c763b5b4
|