Progressive Genome Segment Enhancement (PGSE)
Project description
Progressive Genome Segment Enhancement (PGSE)
Overview
PGSE is an algorithm for predicting phenotypes from whole genome sequencing (WGS) data. It was intiially developed for the prediction of antimicrobial minimum inhibitory concentration (MIC) in bacterial strains. PGSE has higher accuracy, lower memory consumption, and shorter runtime compared to traditional $k$-mer based XGBoost models. PGSE is also able to run on distributed systems.
Contributors
Dr Yinzheng (William) Zhong, Univerisity of Liverpool (algorithm design & Python implementation)
Dr Alessandro Gerada, University of Liverpool (conceptualisation, R package, funding)
Prof William Hope, University of Liverpool (conceptualisation, funding, supervision)
License
This project is licensed under the PolyForm Noncommercial License 1.0.0. See the LICENSE.md file for details.
Installation
PyPi
Make sure Python is installed (3.9 or later) and install pgse from PyPI:
pip install pgse
Conda
To use in a conda environment:
conda create -n pgse python=3.11
conda activate pgse
python -m pip install pgse
pgse is now available to import.
R
To use PGSE through R, install the package in an R session using:
install.packages("devtools")
devtools::install_github("yinzheng-zhong/PGSE", subdir = "R-package")
Usage
Training
Single node/machine
Import the pipeline from the package and run the pipeline like this. You can use your own argument parser or use the one provided by pgse. Also, you can instantiate the pipeline with a wrapper that provides the parameters directly.
# You can use your own argument parser or use the one provided by pgse.
# Or instantiate the pipeline with a wrapper that provides the parameters directly.
from pgse.environment.args import get_parser
from pgse import TrainingPipeline
if __name__ == "__main__":
parser = get_parser()
args = parser.parse_args()
pipeline = TrainingPipeline(
args.data_dir,
args.label_file,
args.pre_kfold_info_file,
args.save_file,
args.export_file,
args.k,
args.ext,
args.target,
args.features,
args.folds,
args.ea_min,
args.ea_max,
args.num_rounds,
args.lr,
args.dist,
args.nodes,
args.workers
)
pipeline.run()
Alternatively, to run PGSE as a standalone program on a local machine, install the package and use the following command as an example:
pgse-train \
--label-file "../<path_to>/<you_labels>.csv" \
--data-dir "../<you_data_dir>/" \
--pre-kfold-info-file "../<k_fold_information>.json" \
--save-file "../<saved progress>.save" \
--export-file "../<exported files>" \
--workers 8 \
--features 10000 \
--dist 0 \
--k 6 \
--target 70 \
--ext 2 \
--lr 0.001 \
--num-rounds 6000 \
--folds 5 \
--ea-max 64 \
--ea-min 0
-
--label-file(Required): path to the .csv label fileHere the label file is a csv file with the following format:
| labels | files | | ------ | --------- | | 7 | file1.fna | | 7 | file2.fna | | 6 | file3.fna |
The labels are the target values for the prediction task. The files are the file names (.fna files under
--data-dir) containing the genome sequences. -
--data-dir(Required): path to the data directory containing the .fna files. PGSE will be able to retrieve the genome sequences using this path and the file names in the label file. -
--pre-kfold-info-file: path to the predefined k-fold info JSON file. This is not required but will be useful if you want to compare PGSE with other systems. Without this, PGSE will split the data into k folds randomly using a fixed seed. E.g.{ "fold_0": [ "Sample_208-MOLMIC_E33.scaffolds.fna", "Sample_726-MOLMIC_F29.scaffolds.fna", "Sample_474-MOLMIC_I14.scaffolds.fna", "Sample_111-MOLMIC_C61.scaffolds.fna", "Sample_087-MOLMIC_C25.scaffolds.fna", "Sample_467-MOLMIC_I6.scaffolds.fna", "..." ], "fold_1": [ "Sample_208-MOLMIC_E33.scaffolds.fna", "Sample_726-MOLMIC_F29.scaffolds.fna", "Sample_474-MOLMIC_I14.scaffolds.fna", "Sample_111-MOLMIC_C61.scaffolds.fna", "Sample_087-MOLMIC_C25.scaffolds.fna", "Sample_467-MOLMIC_I6.scaffolds.fna", "..." ], "...": [ "..." ] }
-
--save-file: file to save the progress. This is useful if you want to resume the training process. -
--export-file: file to export the results. Normally without an extension. This name will be used to store the selected genome segments in an .txt file and the trained model in a .json file. -
--workers: number of workers per node. -
--features: Maximum number of features to keep after the feature importance calculation and ranking. -
--dist: Using distributed computation or not. 0 for running on a single node/machine, 1 for running on multiple nodes. -
--k: initial k-mer size. -
--target: Maximum segment length to extend to. -
--ext: Extension length in each round. Extension parameterpfrom the paper. -
--lr: learning rate. -
--num-rounds: Maximum rounds for the training process. -
--folds: Number of folds for the k-fold cross-validation. -
--ea-max: Maximum number of censored essential agreement values. Don't need this unless you want to see more accurate EA information from the console output during the training. -
--ea-min: Minimum number of censored essential agreement values. Similar to--ea-max.
Distributed computation
To run PGSE on a distributed system, you need to use your environment specific setup. There are multiple examples about running PGSE using Slurm under the slurm-scripts directory.
job-pgse-array.sh: Run PGSE on a cluster using Slurm with multiple nodes for multiple antibiotics using array jobs. Here-distis set to 0 as each task is running separately.job-pgse-dist.sh: Run PGSE on a cluster using Slurm with multiple nodes for a single antibiotic. Here-distis set to 1 as the task is running on different nodes.job-pgse-single.sh: Run PGSE on a Slurm cluster with a single node for a single antibiotic. Here-distis set to 0.
Inferencing
An example of how this can be done is provided in main-pgse-inf.py.
from pgse import InferencePipeline
MODEL_PATH = '../volatile/var/result-k6-CAZ-perf_fold_0.json'
SEGMENT_PATH = '../volatile/var/result-k6-CAZ-perf_fold_0.csv'
if __name__ == "__main__":
# Instantiate the pipeline
pipeline = InferencePipeline(MODEL_PATH, SEGMENT_PATH, workers=8)
# files as a list of paths to the fasta files
EG_1 = [
'../volatile/cgr/Sample_002-MOLMIC_B2.scaffolds.fna',
'../volatile/cgr/Sample_394-MOLMIC_H8.scaffolds.fna',
'../volatile/cgr/Sample_385-MOLMIC_G79.scaffolds.fna',
'../volatile/cgr/Sample_622-MOLMIC_K68.scaffolds.fna',
'../volatile/cgr/Sample_252-MOLMIC_F2.scaffolds.fna',
'../volatile/cgr/Sample_208-MOLMIC_E33.scaffolds.fna',
'../volatile/cgr/Sample_443-MOLMIC_H62.scaffolds.fna',
'../volatile/cgr/Sample_565-MOLMIC_J66.scaffolds.fna',
'../volatile/cgr/Sample_339-MOLMIC_G29.scaffolds.fna',
'../volatile/cgr/Sample_418-MOLMIC_H33.scaffolds.fna',
]
result_1 = pipeline.run(EG_1)
print(result_1)
EG_2 = [
'../volatile/cgr/Sample_394-MOLMIC_H8.scaffolds.fna',
'../volatile/cgr/Sample_385-MOLMIC_G79.scaffolds.fna',
'../volatile/cgr/Sample_622-MOLMIC_K68.scaffolds.fna',
'../volatile/cgr/Sample_252-MOLMIC_F2.scaffolds.fna'
]
result_2 = pipeline.run(EG_2)
print(result_2)
To run the inference pipeline as a standalone program, install the package and use the following command as an example:
pgse-predict \
--model-file "../<path_to_model>.json" \
--segment-file "../<path_to_segment>.csv" \
--data-dir "../<you_data_dir>/" \
--workers 8
### R package
To use PGSE through the R package, consult the package
[documentation](https://github.com/yinzheng-zhong/PGSE/tree/main/R-package/).
## For Development
To build the package, run the following command:
```bash
rm -rf dist/ build/ pgse.egg-info/
python -m build
Then upload the package to PyPI using:
python -m twine upload dist/*
To install the package locally, run:
pip install -e .
Acknowledgements
This work was funded, in part, by UKRI and the Wellcome trust.
This work was undertaken on Barkla, part of the High Performance Computing facilities at the Univeristy of Liverpool, UK.
Common Issues
XGBoost training is only using one core.
Some linux distributions need an environment variable OMP_NUM_THREADS=<num threads> to be set to allow XGBoost to use multiple cores.
Q & A
Why do we perform feature partitioning?
There are four reasons why feature partitioning is crucial in PGSE. First, feature partitioning is used as a memory reduction technique. The model is trained on a subset of the features at a time, therefore, the memory consumption is reduced while maintained a relatively stable RAM usage regardless of the number of total features. Second, feature partitioning helps to parallelise the training process. Each partition can be trained on a different worker across different nodes. This is particularly useful as XGBoost training consumes most of the time in the training process. Third, from the experiments we have conducted, we found that feature dimensionality affects the model's optimal hyperparameters. For example, higher feature dimensionality requires a shallower tree depth in general. PGSE is a dynamic system that and the total number of features can be different in each round. Therefore, partitioning the features into similarly-sized sub-features can help to minimise the impact of the feature dimensionality on the model's hyperparameters. Finally, feature partitioning helps to preserve the feature importance information from XGBoost. Likely due to the pruning process, more feature importance information will be lost (become 0) if the dimensionality increases.
Why do we eliminate features?
If segment A is extended into segment B, A becomes a subsequences of B. For pairs like A and B, we only need to keep the
ones with higher feature importance. Extension and elimination are two crucial parts of the PGSE system, which grows the
genome segments longer and the elimination process guarantees that the growth will stop eventually. Additionally, elimination
guarantees the convergence of the system as the feature dimensionality will start decreasing at some point till
all features stop growing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pgse-0.8.5-py3-none-any.whl.
File metadata
- Download URL: pgse-0.8.5-py3-none-any.whl
- Upload date:
- Size: 79.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37ec86e4c70e44dcdd0ea78932ce3505753b1d75001b0faa3b10577556e28556
|
|
| MD5 |
246a32aca75b5152b26bb0cb2d3ce4da
|
|
| BLAKE2b-256 |
87904da1ba202cf55636afd4d27cb53b98f3ffd6a237c17ad0e902b764f03b63
|