Skip to main content

Hierarchical Clustering of cgMLST profiles

Project description

.. image:: https://img.shields.io/pypi/v/phiercc.svg :alt: pHierCC on the Python Package Index (PyPI) :target: https://pypi.python.org/pypi/phiercc .. image:: https://img.shields.io/conda/vn/zhemin/phiercc.svg :alt: pHierCC on the Anaconda Cloud :target: https://anaconda.org/zhemin/phiercc

Hosted by

.. image:: https://warwick.ac.uk/fac/sci/med/research/biomedical/mi/enterobase/enterobase.jpg?maxWidth=300 :alt: The EnteroBase Website :target: https://enterobase.warwick.ac.uk

HierCC (Hierarchical clustering of cgMLST)

HierCC is a multi-level clustering scheme for population assignments based on core genome Multi-Locus Sequence Types (cgMLSTs). HierCC has been implemented in EnteroBase <https://enterobase.warwick.ac.uk>_ since 2018.

pHierCC

pHierCC is an independent python package that generates and evaluates a HierCC scheme based on any cgMLST scheme. pHierCC is open source software made available under GPL-3.0 License <https://github.com/zheminzhou/HierCC/blob/master/LICENSE>_.

  • If you use pHierCC in work contributing to a scientific publication, we ask that you cite our preprint below:

Zhou Z, Charlesworth J, Achtman M (2020) HierCC: A multi-level clustering scheme for population assignments based on core genome MLST. bioRxiv. DOI: https://doi.org/10.1101/2020.11.25.397539

  • If you use HierCC assignments that are hosted in EnteroBase, we ask that you cite our publication:

Zhou Z, Alikhan NF, Mohamed K, the Agama Study Group, Achtman M (2020) The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny and Escherichia core genomic diversity. Genome Res. 30:138-152. DOI: https://dx.doi.org/10.1101%2Fgr.251678.119

Installation

  • Python 3.6 onwards, pHierCC can be directly installed and upgraded via PIP, with just one terminal command::

    pip install pHierCC pip install --upgrade pHierCC

  • pHierCC is also made available as an Anaconda package, and can be installed via conda with the following command::

    conda install -c zhemin phiercc

Alternatively, you may wish to download the GitHub repo and install the dependencies yourself as shown below.

Python version

pHierCC is currently supported and tested on three Python versions:

  • 3.6
  • 3.7
  • 3.8 (recommended)

Python 3.9 is currently NOT supported, because Numba, one of the libaries that pHierCC depends on, is not compatible with Python 3.9. This issue is expected to get resolved early 2021 according to this thread <https://github.com/numba/numba/issues/6345>_.

Python libraries

pHierCC requires:

  • numpy <https://numpy.org/>_ (>=1.18.1)
  • scipy <https://www.scipy.org/>_ (>=1.3.2)
  • pandas <https://pandas.pydata.org/>_ (>=0.24.2)
  • numba <https://numba.pydata.org/>_ (>=0.38.0)
  • scikit-learn <https://scikit-learn.org/>_ (>=0.23.1)
  • matplotlib <https://matplotlib.org/>_ (>=3.2.1)
  • Click <https://click.palletsprojects.com/en/7.x/>_ (>=7.0)
  • SharedArray <https://pypi.org/project/SharedArray/>_ (>=3.2.1)

Download dataset

A toy dataset of cgMLST profiles is hosted in this repository. It can be downloaded using this command::

curl -o YERwgMLST.cgMLSTv1.profile.gz https://raw.githubusercontent.com/zheminzhou/pHierCC/master/examples/YERwgMLST.cgMLSTv1.profile.gz

Run pHierCC

pHierCC can be run on the toy dataset using the following command::

pHierCC -p YERwgMLST.cgMLSTv1.profile.gz -o YERwgMLST.cgMLSTv1.HierCC

And the full usage of pHierCC is::

$ pHierCC --help Usage: pHierCC [OPTIONS]

 pHierCC takes a file containing allelic profiles (as in
 https://pubmlst.org/data/) and works out hierarchical clusters of the full
 dataset based on a minimum-spanning tree.

Options: -p, --profile TEXT [INPUT] name of a profile file consisting of a table of columns of the ST numbers and the allelic numbers, separated by tabs. Can be GZIPped. [required]

 -o, --output TEXT            [OUTPUT] Prefix for the output files consisting
                              of a  NUMPY and a TEXT version of the
                              clustering result.   [required]

 -a, --append TEXT            [INPUT; optional] The NPZ output of a previous
                              pHierCC run (Default: None).

 -m, --allowed_missing FLOAT  [INPUT; optional] Allowed proportion of missing
                              genes in pairwise comparisons (Default: 0.03).

 -n, --n_proc INTEGER         [INPUT; optional] Number of processes (CPUs) to
                              use (Default: 4).

 --help                       Show this message and exit.

pHierCC inputs

pHierCC runs in two modes. 'Development mode' builds a multi-level hierarchical clustering scheme from scratch, whilst 'Production mode' assigns new in-coming genomes to clusters incrementally, without changing the cluster assignments of any existing genome. You can find technical details in the Supplementary Text of the bioRxiv preprint <https://doi.org/10.1101/2020.11.25.397539>_.

  • 'Development mode' requires only one file (--profile) containing allelic profiles of cgMLST STs, in either plain text or GZIP format. You can find additional examples of allelic profiles in https://pubmlst.org/data.
  • 'Production mode' is triggered when an additional option, '--append', is provided with a NPZ file consisting a pre-existing multi-level assignment, which is part of the output (see below) of a previous pHierCC run.

pHierCC outputs

Both modes of pHierCC generate two outputs:

  • .npz
  • .HierCC.gz

Both output files contain the same multi-level clustering assigments for every cgMLST ST. The NPZ file is used as input for running pHierCC in production mode, whilst the HierCC.gz file is human readable. The first three lines of the .HierCC.gz is like::

#ST_id

The first column is the cgMLST ST, and the remaining columns are the clustering results, from almost identical (HC0) to completely different.

Run HCCeval

HCCeval evaluates the thousands of clustering levels generated by pHierCC and identifies potentially biologically meaningful clustering levels. HCCeval can be run on the HierCC results of the toy dataset with the following command::

HCCeval -p YERwgMLST.cgMLSTv1.profile.gz -c YERwgMLST.cgMLSTv1.HierCC.HierCC.gz -o YERwgMLST.cgMLSTv1.HierCC.eval

And the full usage of HCCeval is::

$ HCCeval --help Usage: HCCeval [OPTIONS]

 evalHCC evaluates a HierCC scheme using varied statistic summaries.

Options: -p, --profile TEXT [INPUT] Name of a profile file consisting of a table of columns of the ST numbers and the allelic numbers, separated by tabs. Can be GZIPped. [required]

 -c, --cluster TEXT      [INPUT] Name of the pHierCC text output. Can be
                         GZIPped.  [required]

 -o, --output TEXT       [OUTPUT] Prefix for the two output files.
                         [required]

 -s, --stepwise INTEGER  [INPUT; optional] Evaluate every <stepwise> levels
                         (Default: 10).

 -n, --n_proc INTEGER    [INPUT; optional] Number of processes (CPUs) to use
                         (Default: 4).

 --help                  Show this message and exit.

HCCeval inputs

HCCeval requires two inputs:

  • (--profile) A file containing allelic profiles, in plain text or gzipped (see pHierCC inputs <README.rst#phiercc-inputs>_).
  • (--cluster) The human readable .HierCC.gz output by pHierCC (see pHierCC outputs <README.rst#phiercc-outputs>_).

HCCeval outputs

HCCeval generates two outputs of the same evaluation results:

  • .val.tsv
  • .val.pdf

The PDF file is a visualization of the TSV file. You can find examples of the PDF outputs in the supplemental Figure S1 <https://www.biorxiv.org/content/biorxiv/early/2020/11/26/2020.11.25.397539/DC1/embed/media-1.pdf>_ of the preprint. Both files contain two statistical evaluations of the clustering levels:

  1. Normalized Mutual Information (NMI) <https://en.wikipedia.org/wiki/Mutual_information>_ (Kvalseth TO 1987 <https://ieeexplore.ieee.org/abstract/document/4309069>). Mutual Information measures the similarity of two different clusterings of a dataset as a harmonic mean of homogeneity and completeness. It is similar to the better known Rand Index, but gives more accurate estimates for dataset <https://jmlr.csail.mit.edu/papers/volume17/15-627/15-627> that contains many small clusters, which is often the case for HierCC clustering. HCCeval calculates an NMI score for each pairwise combination of HierCC levels based on the clustering of cgSTs at each level.

  2. Silhouette score <https://en.wikipedia.org/wiki/Silhouette_(clustering)>_ (Rousseeuw PJ 1987 <https://www.sciencedirect.com/science/article/pii/0377042787901257>_). Silhouette score estimates the cohesiveness of a clustering result by measuring how similar a cgST is to both to cgSTs within its own cluster (cohesion) and in comparison to other clusters (separation). The Silhouette score ranges between -1 and +1, where a high value indicates a robust clustering.

In practice, 'stable blocks' are identified from HierCC clustering using NMI. Each stable block of NMI scores consists of a continuous set of HierCC levels that define highly similar clusters (NMI >= 0.9). This indicates that the clusters generated by these HierCC levels are robust to modest changes of the clustering thresholds. The most cohesive HierCC level in each stable block (ie the level within each block with the greatest Silhouette score) is likely to represent natural microbial population structure.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HierCC-1.24.tar.gz (32.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page