Stage-Aware Event-Based Modeling for Disease Progression

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

`pysaebm`

pysaebm is a Python package for generating and analyzing biomarker data using Stage-Aware Event-Based Modeling (SA-EBM). It supports various data generation experiments and EBM algorithms to estimate biomarker orderings and disease stages. This package is designed for researchers and data scientists working with biomarker progression analysis.

For detailed methodology, refer to our paper.

Installation

Install pysaebm using pip:

pip install pysaebm

Ensure you have Python 3.8+ and the required dependencies installed. For a full list of dependencies, see requirements.txt.

Data generation

Examples of how to generate data are at ./pysaebm/test/gen.py.

Because in each generation, the ordering is randomized, you will see a true_order_and_stages.json that tells you the corresponding true stages and true order for each output csv file.

The source codes for data generation can be seen in ./pysaebm/data_generate.py.

This is the full generate parameters:

def generate(
   experiment_name: str = "sn_kjOrdinalDM_xnjNormal",
   params_file: str = 'params.json',
   js: List[int] = [50, 200, 500, 1000],
   rs: List[float] = [0.1, 0.25, 0.5, 0.75, 0.9],
   num_of_datasets_per_combination: int = 50,
   output_dir: str = 'data',
   seed: Optional[int] = None,
   dirichlet_alpha: Optional[Dict[str, List[float]]] = {
       'uniform': [100],
       'multinomial': [0.4013728324975898,
                       1.0910444770153345,
                       2.30974117596663,
                       3.8081194066281103,
                       4.889722107892335,
                       4.889722107892335,
                       3.8081194066281103,
                       2.30974117596663,
                       1.0910444770153345,
                       0.4013728324975898]
   },
   beta_params: Dict[str, Dict[str, float]] = {
       'near_normal': {'alpha': 2.0, 'beta': 2.0},
       'uniform': {'alpha': 1, 'beta': 1},
       'regular': {'alpha': 5, 'beta': 2}
   },
   prefix: Optional[str] = None,
   suffix: Optional[str] = None,
   keep_all_cols: bool = False,
   fixed_biomarker_order: bool = False,
   noise_std_parameter: float = 0.05,
) -> Dict[str, Dict[str, int]]:

Explanations:

experiment_name should be one of the these:

experiment_names = [
  "sn_kjOrdinalDM_xnjNormal",     # Experiment 1: Ordinal kj with Dirichlet-Multinomial, Normal Xnj
  "sn_kjOrdinalDM_xnjNonNormal",  # Experiment 2: Ordinal kj with Dirichlet-Multinomial, Non-Normal Xnj
  "sn_kjOrdinalUniform_xnjNormal", # Experiment 3: Ordinal kj with Uniform distribution, Normal Xnj
  "sn_kjOrdinalUniform_xnjNonNormal", # Experiment 4: Ordinal kj with Uniform distribution, Non-Normal Xnj
  "sn_kjContinuousUniform",       # Experiment 5: Continuous kj with Uniform distribution
  "sn_kjContinuousBeta",          # Experiment 6: Continuous kj with Beta distribution
  "xiNearNormal_kjContinuousUniform", # Experiment 7: Near-normal Xi with Continuous Uniform kj
  "xiNearNormal_kjContinuousBeta", # Experiment 8: Near-normal Xi with Continuous Beta kj
  "xiNearNormalWithNoise_kjContinuousBeta", # Experiment 9: Same as Exp 8 but with noises to xi
]

You can find the explanation to these terms from our paper.

params_file: The path to the parameters in json. Example is ./pysaebm/data/params.json. You should specify each biomarker's theta_mean, theta_std, phi_mean, and phi_std.
js: An array of integers indicating the number of participants you want.
rs: An array of floats indicating the number of healthy ratios.
num_of_datasets_per_combination: The number of repetitions for each j-r combination.
output_dir: The directory where you want to save the generated data.
seed: An integer serving as the seed for the randomness.
dirichlet_alpha: This should be a dictionary where keys are uniform and multinomial. They correspond to kjOrdinalUniform and kjOrdinalDM.
beta_params: A dictionary where keys are near_normal, uniform, and regular, corresponding to xiNearNormal, kjContinuousUniform and kjContinuousBeta.
prefix: Optional prefix for the output csv file names.
suffix: Optional suffix for the output csv file names.
keep_all_cols: Whether to include additional metadata columns (k_j, event_time, affected)
fixed_biomarker_order: If True, will use the order as in the params_file. If False, will randomize the ordering.
noise_std_parameter: the parameter in N(0, N \cdot noise_std_parameter) in experiment 9

Note that you need to make sure the dirichlet_alpha['multinomial'] has the same length as your params dict (as in your params_file).

Run EBM Algorithms

Examples of how to run algorithms and get results is at ./pysaebm/test/test.py.

This explains the parameters well enough:

def run_ebm(
   algorithm: str,
   data_file: str,
   output_dir: str,
   output_folder: Optional[str] = None,
   n_iter: int = 2000,
   n_shuffle: int = 2,
   burn_in: int = 500,
   thinning: int = 1,
   true_order_dict: Optional[Dict[str, int]] = None,
   true_stages: Optional[List[int]] = None,
   plot_title_detail: Optional[str] = "",
   fname_prefix: Optional[str] = "",
   skip_heatmap: Optional[bool] = False,
   skip_traceplot: Optional[bool] = False,
   # Strength of the prior belief in prior estimate of the mean (μ), set to 1 as default
   prior_n: float = 1.0,
   # Prior degrees of freedom, influencing the certainty of prior estimate of the variance (σ²), set to 1 as default
   prior_v: float = 1.0,
   bw_method: str = 'scott',
   seed: int = 123,
) -> Dict[str, Union[str, int, float, Dict, List]]:
   """
   Run the metropolis hastings algorithm and save results


   Args:
       algorithm (str): Choose from 'hard_kmeans', 'mle', 'em', 'kde', and 'conjugate_priors'.
       data_file (str): Path to the input CSV file with biomarker data.
       output_dir (str): Path to the directory to store all the results.
       output_folder (str): Optional. If not provided, all results will be saved to output_dir/algorithm.
           If provided, results will be saved to output_dir/output_folder
       n_iter (int): Number of iterations for the Metropolis-Hastings algorithm.
       n_shuffle (int): Number of shuffles per iteration.
       burn_in (int): Burn-in period for the MCMC chain.
       thinning (int): Thinning interval for the MCMC chain.
       true_order_dict (Optional[Dict[str, int]]): biomarker name: the correct order of it (if known)
       true_stages (Optional[List[int]]): true stages for all participants (if known)
       plot_title_detail (Optional[str]): optional string to add to plot title, as suffix.
       fname_prefix (Optional[str]): the prefix of heatmap, traceplot, results.json, and logs file, e.g., 5_50_0_heatmap_conjugate_priors.png
           In the example, there are no prefix strings.
       skip_heatmap (Optional[bool]): whether to save heatmaps. True if you want to skip saving heatmaps and save space.
       skip_traceplot (Optional[bool]): whether to save traceplots. True if you want to skip saving traceplots and save space.
       prior_n (strength of belief in prior of mean): default to be 1.0
       prior_v (prior degree of freedom) are the weakly informative priors, default to be 1.0
       bw_method (str): bandwidth selection method in kde
       seed (int): for reproducibility

   Returns:
       Dict[str, Union[str, int, float, Dict, List]]: Results including everything, e.g., Kendall's tau and p-value.
   """

Some extra explanations:

n_iter: In general, above 2k is recommended. 10k should be sufficient if you have <= 10 biomarkers.
burn_in and thinning: The idea behind the two parameters is that we will only use some of the results from all iterations in n_iter. We will do this: if (i > burn_in) & (i % thinning == 0), then we will use the result from that iteration i. Usually, we set thinning to be 1.

After running the run_ebm, you'll see a folder named as your output_dir. Each algorithm will have its subfolders.

The results are organized this way:

records folder contains the loggings.
heatmaps folder contains all the heatmaps. An example is

An example of heatmap

Biomarkers in the y-axis are ranked according to the ordering that has the highest likelihood among all iterations. Each cell indicates the probability that a certain biomarker falls in a certain stage. Note, however, these probabilities are calculated based on all the iterations that satisfy (i > burn_in) & (i % thinning == 0).

In the heatmap, the sum of each col and each row is 1.

traceplots folder contains all the traceplots (starting from iteration 40, not iteration 0). Those plots will be useful to diagnose whether EBM algorithms are working correctly. It's totally okay for the plots to show fluctuation (because biomarker distribution and stage distributions are re-calculated each iterations). You should not, however, see a clear downward trend.

An example of traceplot

results folder contains all important results in json. Each file contains

results = {
        "algorithm": algorithm,
        "runtime": end_time - start_time,
        "N_MCMC": n_iter,
        "n_shuffle": n_shuffle,
        "burn_in": burn_in,
        "thinning": thinning,
        'healthy_ratio': healthy_ratio,
        "max_log_likelihood": float(max(log_likelihoods)),
        "kendalls_tau": tau,
        "p_value": p_value,
        "mean_absolute_error": mae,
        'current_pi': current_pi.tolist(),
        # updated pi is the pi for all stages, including 0
        'updated_pi': updated_pi.tolist(),
        'true_order': true_order_result,
        "order_with_highest_ll": {k: int(v) for k, v in zip(biomarker_names, order_with_highest_ll)},
        "true_stages": true_stages,
        'ml_stages': ml_stages,
        "stage_likelihood_posterior": final_stage_post_dict,
        "final_theta_phi_params": final_theta_phi_dict,
    }

algorithm is the algorithm you are running with.
runtime in seconds.
N_MCMC is the number of MCMC iterations.
n_shuffle is how many biomarkers to shuffle places in Metropolis Hastings. Default is 2.
burn_in and thinning are explained above.
healthy_ratio is the percentage of non-diseased participants in the dataset.
max_log_likelihood is the max data log likelihood. The script mh generates all_accepted_orders, log_likelihoods, current_theta_phi, current_stage_post, and current_pi. max_log_likelihoods is the max of log_likelihoods.
kendalls_tau and p_value are the result of comparing the order_with_highest_ll with the true order (if provided).
mean_absolute_error is the result of comparing ml_stages and true_stages (if provided). This measurement is for when we include estimation of healthy participants as well. That is to say, we estimate their disease stages without knowing they are healthy.
current_pi is the probability distribution of all disease stages.
current_pi is the probability distribution of all stages, including 0.
true_order is the dictionary explaining the stage each biomarker is at.
order_with_highest_ll is the ordering that corresponds to the max_log_likelihoods.
true_stages is an array of each participant's disease stage.
ml_stages is the array of most likely disease stages.
true_stages_diseased is the array of actual stages for diseased participants only.
'ml_stages_diseased is the array of the estimated stages for diseased participants only.
stage_likelihood_posterior is a dictionary detailing each participant's probability in each of the possible stages.
final_theta_phi_params is a dictionary detailing the parameters for each biomarker. If you are using kde, you'll see theta_weights and phi_weights. If you use other algorithms, you will see the theta_mean, theta_std, phi_mean and phi_std.

Use your own data

You are more than welcome to use your own data. After all, the very purpose of pysaebm is: to allow you to analyze your own data. However, you do have to make sure that the input data have at least four columns:

participant: int
biomarker: str
measurement: float
diseased: bool

The participant column should be integers from 0 to J-1 where J is the number of participants.

Samples are available at ./pysaebm/data/samples/.

The data should be in a tidy format, i.e.,

Each variable is a column.
Each observation is a row.
Each type of observational unit is a table.

Change Log

2025-08-03 (V 0.0.5)
- Implemented the numpy version.
- Implemented numba in functions.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.0.6

Aug 4, 2025

This version

0.0.5

Aug 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsaebm-0.0.5.tar.gz (54.7 kB view details)

Uploaded Aug 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastsaebm-0.0.5-py3-none-any.whl (52.6 kB view details)

Uploaded Aug 4, 2025 Python 3

File details

Details for the file fastsaebm-0.0.5.tar.gz.

File metadata

Download URL: fastsaebm-0.0.5.tar.gz
Upload date: Aug 4, 2025
Size: 54.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.19

File hashes

Hashes for fastsaebm-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`9ddbd0748145ccb3a22a57f87228523809ee1845e6cbee71cc1a2297a29f41a3`
MD5	`3d3f04e7fde0975433ab13b3e5d88444`
BLAKE2b-256	`50ae2861717bac861015fe0bfba4425cf3806aa40648319fcc9fbd6069c8e4a6`

See more details on using hashes here.

File details

Details for the file fastsaebm-0.0.5-py3-none-any.whl.

File metadata

Download URL: fastsaebm-0.0.5-py3-none-any.whl
Upload date: Aug 4, 2025
Size: 52.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.19

File hashes

Hashes for fastsaebm-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64b088ae704b5c9c17939148aef00708cda054998a7c70611d3b38fd3989a531`
MD5	`f875e1353986eae85b3892912c959979`
BLAKE2b-256	`1ee8d671ad7b906f9e2e88b1b3698c6376d464fcc6b70692613e2db45f67907a`

See more details on using hashes here.

fastsaebm 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

`pysaebm`

Installation

Data generation

Run EBM Algorithms

Use your own data

Change Log

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes