Skip to main content

Progressive Genome Segment Enhancement (PGSE)

Project description

Progressive Genome Segment Enhancement (PGSE)

Overview

PGSE is an algorithm for predicting phenotypes from whole genome sequencing (WGS) data. It was intiially developed for the prediction of antimicrobial minimum inhibitory concentration (MIC) in bacterial strains. PGSE has higher accuracy, lower memory consumption, and shorter runtime compared to traditional $k$-mer based XGBoost models. PGSE is also able to run on distributed systems.

Contributors

Dr Yinzheng (William) Zhong, Univerisity of Liverpool (algorithm design & Python implementation)

Dr Alessandro Gerada, University of Liverpool (conceptualisation, R package, funding)

Prof William Hope, University of Liverpool (conceptualisation, funding, supervision)

License

This project is licensed under the PolyForm Noncommercial License 1.0.0. See the LICENSE.md file for details.

Installation

PyPi

Make sure Python is installed (3.9 or later) and install pgse from PyPI:

pip install pgse

Conda

To use in a conda environment:

conda create -n pgse python=3.11
conda activate pgse
python -m pip install pgse

pgse is now available to import.

R

To use PGSE through R, install the package in an R session using:

install.packages("devtools")
devtools::install_github("yinzheng-zhong/PGSE", subdir = "R-package")

Usage

Training

Single node/machine

Import the pipeline from the package and run the pipeline like this. You can use your own argument parser or use the one provided by pgse. Also, you can instantiate the pipeline with a wrapper that provides the parameters directly.

# You can use your own argument parser or use the one provided by pgse.
# Or instantiate the pipeline with a wrapper that provides the parameters directly.
from pgse.environment.args import get_parser
from pgse import TrainingPipeline

if __name__ == "__main__":
  parser = get_parser()
  args = parser.parse_args()

  pipeline = TrainingPipeline(
    args.data_dir,
    args.label_file,
    args.pre_kfold_info_file,
    args.save_file,
    args.export_file,
    args.k,
    args.ext,
    args.target,
    args.features,
    args.folds,
    args.ea_min,
    args.ea_max,
    args.num_rounds,
    args.lr,
    args.dist,
    args.nodes,
    args.workers
  )

  pipeline.run()

Alternatively, to run PGSE as a standalone program on a local machine, install the package and use the following command as an example:

pgse-train \
        --label-file "../<path_to>/<you_labels>.csv" \
        --data-dir "../<you_data_dir>/" \
        --pre-kfold-info-file "../<k_fold_information>.json" \
        --save-file "../<saved progress>.save" \
        --export-file "../<exported files>" \
        --workers 8 \
        --features 10000 \
        --dist 0 \
        --k 6 \
        --target 70 \
        --ext 2 \
        --lr 0.001 \
        --num-rounds 6000 \
        --folds 5 \
        --ea-max 64 \
        --ea-min 0
  • --label-file (Required): path to the .csv label file

    Here the label file is a csv file with the following format:

    | labels | files     |
    | ------ | --------- |
    | 7      | file1.fna |
    | 7      | file2.fna |
    | 6      | file3.fna |
    

    The labels are the target values for the prediction task. The files are the file names (.fna files under --data-dir) containing the genome sequences.

  • --data-dir (Required): path to the data directory containing the .fna files. PGSE will be able to retrieve the genome sequences using this path and the file names in the label file.

  • --pre-kfold-info-file: path to the predefined k-fold info JSON file. This is not required but will be useful if you want to compare PGSE with other systems. Without this, PGSE will split the data into k folds randomly using a fixed seed. E.g.

    {
        "fold_0": [
            "Sample_208-MOLMIC_E33.scaffolds.fna",
            "Sample_726-MOLMIC_F29.scaffolds.fna",
            "Sample_474-MOLMIC_I14.scaffolds.fna",
            "Sample_111-MOLMIC_C61.scaffolds.fna",
            "Sample_087-MOLMIC_C25.scaffolds.fna",
            "Sample_467-MOLMIC_I6.scaffolds.fna",
            "..."
        ],
        "fold_1": [
            "Sample_208-MOLMIC_E33.scaffolds.fna",
            "Sample_726-MOLMIC_F29.scaffolds.fna",
            "Sample_474-MOLMIC_I14.scaffolds.fna",
            "Sample_111-MOLMIC_C61.scaffolds.fna",
            "Sample_087-MOLMIC_C25.scaffolds.fna",
            "Sample_467-MOLMIC_I6.scaffolds.fna",
            "..."
        ],
        "...": [
        "..."
        ]
    }
    
  • --save-file: file to save the progress. This is useful if you want to resume the training process.

  • --export-file: file to export the results. Normally without an extension. This name will be used to store the selected genome segments in an .txt file and the trained model in a .json file.

  • --workers: number of workers per node.

  • --features: Maximum number of features to keep after the feature importance calculation and ranking.

  • --dist: Using distributed computation or not. 0 for running on a single node/machine, 1 for running on multiple nodes.

  • --k: initial k-mer size.

  • --target: Maximum segment length to extend to.

  • --ext: Extension length in each round. Extension parameter p from the paper.

  • --lr: learning rate.

  • --num-rounds: Maximum rounds for the training process.

  • --folds: Number of folds for the k-fold cross-validation.

  • --ea-max: Maximum number of censored essential agreement values. Don't need this unless you want to see more accurate EA information from the console output during the training.

  • --ea-min: Minimum number of censored essential agreement values. Similar to --ea-max.

Distributed computation

To run PGSE on a distributed system, you need to use your environment specific setup. There are multiple examples about running PGSE using Slurm under the slurm-scripts directory.

  • job-pgse-array.sh: Run PGSE on a cluster using Slurm with multiple nodes for multiple antibiotics using array jobs. Here -dist is set to 0 as each task is running separately.
  • job-pgse-dist.sh: Run PGSE on a cluster using Slurm with multiple nodes for a single antibiotic. Here -dist is set to 1 as the task is running on different nodes.
  • job-pgse-single.sh: Run PGSE on a Slurm cluster with a single node for a single antibiotic. Here -dist is set to 0.

Inferencing

An example of how this can be done is provided in main-pgse-inf.py.

from pgse import InferencePipeline

MODEL_PATH = '../volatile/var/result-k6-CAZ-perf_fold_0.json'
SEGMENT_PATH = '../volatile/var/result-k6-CAZ-perf_fold_0.csv'

if __name__ == "__main__":
    # Instantiate the pipeline
    pipeline = InferencePipeline(MODEL_PATH, SEGMENT_PATH, workers=8)

    # files as a list of paths to the fasta files
    EG_1 = [
        '../volatile/cgr/Sample_002-MOLMIC_B2.scaffolds.fna',
        '../volatile/cgr/Sample_394-MOLMIC_H8.scaffolds.fna',
        '../volatile/cgr/Sample_385-MOLMIC_G79.scaffolds.fna',
        '../volatile/cgr/Sample_622-MOLMIC_K68.scaffolds.fna',
        '../volatile/cgr/Sample_252-MOLMIC_F2.scaffolds.fna',
        '../volatile/cgr/Sample_208-MOLMIC_E33.scaffolds.fna',
        '../volatile/cgr/Sample_443-MOLMIC_H62.scaffolds.fna',
        '../volatile/cgr/Sample_565-MOLMIC_J66.scaffolds.fna',
        '../volatile/cgr/Sample_339-MOLMIC_G29.scaffolds.fna',
        '../volatile/cgr/Sample_418-MOLMIC_H33.scaffolds.fna',
    ]

    result_1 = pipeline.run(EG_1)
    print(result_1)

    EG_2 = [
        '../volatile/cgr/Sample_394-MOLMIC_H8.scaffolds.fna',
        '../volatile/cgr/Sample_385-MOLMIC_G79.scaffolds.fna',
        '../volatile/cgr/Sample_622-MOLMIC_K68.scaffolds.fna',
        '../volatile/cgr/Sample_252-MOLMIC_F2.scaffolds.fna'
    ]

    result_2 = pipeline.run(EG_2)
    print(result_2)

To run the inference pipeline as a standalone program, install the package and use the following command as an example:

pgse-predict \
        --model-file "../<path_to_model>.json" \
        --segment-file "../<path_to_segment>.csv" \
        --data-dir "../<you_data_dir>/" \
        --workers 8
### R package

To use PGSE through the R package, consult the package
[documentation](https://github.com/yinzheng-zhong/PGSE/tree/main/R-package/).

## For Development

To build the package, run the following command:
```bash
rm -rf dist/ build/ pgse.egg-info/
python -m build

Then upload the package to PyPI using:

python -m twine upload dist/*

To install the package locally, run:

pip install -e .

Acknowledgements

This work was funded, in part, by UKRI and the Wellcome trust.

This work was undertaken on Barkla, part of the High Performance Computing facilities at the Univeristy of Liverpool, UK.

Common Issues

XGBoost training is only using one core.

Some linux distributions need an environment variable OMP_NUM_THREADS=<num threads> to be set to allow XGBoost to use multiple cores.

Q & A

Why do we perform feature partitioning?

There are four reasons why feature partitioning is crucial in PGSE. First, feature partitioning is used as a memory reduction technique. The model is trained on a subset of the features at a time, therefore, the memory consumption is reduced while maintained a relatively stable RAM usage regardless of the number of total features. Second, feature partitioning helps to parallelise the training process. Each partition can be trained on a different worker across different nodes. This is particularly useful as XGBoost training consumes most of the time in the training process. Third, from the experiments we have conducted, we found that feature dimensionality affects the model's optimal hyperparameters. For example, higher feature dimensionality requires a shallower tree depth in general. PGSE is a dynamic system that and the total number of features can be different in each round. Therefore, partitioning the features into similarly-sized sub-features can help to minimise the impact of the feature dimensionality on the model's hyperparameters. Finally, feature partitioning helps to preserve the feature importance information from XGBoost. Likely due to the pruning process, more feature importance information will be lost (become 0) if the dimensionality increases.

Why do we eliminate features?

If segment A is extended into segment B, A becomes a subsequences of B. For pairs like A and B, we only need to keep the ones with higher feature importance. Extension and elimination are two crucial parts of the PGSE system, which grows the genome segments longer and the elimination process guarantees that the growth will stop eventually. Additionally, elimination guarantees the convergence of the system as the feature dimensionality will start decreasing at some point till all features stop growing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pgse-0.8.5-py3-none-any.whl (79.7 kB view details)

Uploaded Python 3

File details

Details for the file pgse-0.8.5-py3-none-any.whl.

File metadata

  • Download URL: pgse-0.8.5-py3-none-any.whl
  • Upload date:
  • Size: 79.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for pgse-0.8.5-py3-none-any.whl
Algorithm Hash digest
SHA256 37ec86e4c70e44dcdd0ea78932ce3505753b1d75001b0faa3b10577556e28556
MD5 246a32aca75b5152b26bb0cb2d3ce4da
BLAKE2b-256 87904da1ba202cf55636afd4d27cb53b98f3ffd6a237c17ad0e902b764f03b63

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page