Skip to main content

Hierarchical Clustering of cgMLST profiles

Project description

.. image:: https://img.shields.io/pypi/v/phiercc.svg :alt: pHierCC on the Python Package Index (PyPI) :target: https://pypi.python.org/pypi/phiercc .. image:: https://img.shields.io/conda/vn/zhemin/phiercc.svg :alt: pHierCC on the Anaconda Cloud :target: https://anaconda.org/zhemin/phiercc

Hosted by

.. image:: https://warwick.ac.uk/fac/sci/med/research/biomedical/mi/enterobase/enterobase.jpg?maxWidth=300 :alt: The EnteroBase Website :target: https://enterobase.warwick.ac.uk

HierCC (Hierarchical clustering of cgMLST)

HierCC is a multi-level clustering scheme for population assignments based on core genome Multi-Locus Sequence Types (cgMLSTs). HierCC has been implemented in EnteroBase <https://enterobase.warwick.ac.uk>_ since 2018.

pHierCC

pHierCC is an independent python package that generates and evaluates a HierCC scheme based on any cgMLST scheme. pHierCC is open source software made available under GPL-3.0 License <https://github.com/zheminzhou/HierCC/blob/master/LICENSE>_.

  • If you use pHierCC in work contributing to a scientific publication, we ask that you cite our preprint below:

Zhou Z, Charlesworth J, Achtman M (2020) HierCC: A multi-level clustering scheme for population assignments based on core genome MLST. bioRxiv. DOI: https://doi.org/10.1101/2020.11.25.397539

  • If you use HierCC assignments that are hosted in EnteroBase, we ask that you cite our publication:

Zhou Z, Alikhan NF, Mohamed K, the Agama Study Group, Achtman M (2020) The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny and Escherichia core genomic diversity. Genome Res. 30:138-152. DOI: https://dx.doi.org/10.1101%2Fgr.251678.119

Installation

  • Python 3.6 onwards, pHierCC can be directly installed and upgraded via PIP, with just one terminal command::

    pip install pHierCC pip install --upgrade pHierCC

  • pHierCC is also made available as an Anaconda package, and can be installed via conda with the following command::

    conda install -c zhemin phiercc

Alternatively, you may wish to download the GitHub repo and install the dependencies yourself as shown below.

Python version

pHierCC is currently supported and tested on three Python versions:

  • 3.6
  • 3.7
  • 3.8 (recommended)

Python 3.9 is currently NOT supported, because Numba, one of the libaries that pHierCC depends on, is not compatible with Python 3.9. This issue is expected to get resolved early 2021 according to this thread <https://github.com/numba/numba/issues/6345>_.

Python libraries

pHierCC requires:

  • numpy <https://numpy.org/>_ (>=1.18.1)
  • scipy <https://www.scipy.org/>_ (>=1.3.2)
  • pandas <https://pandas.pydata.org/>_ (>=0.24.2)
  • numba <https://numba.pydata.org/>_ (>=0.38.0)
  • scikit-learn <https://scikit-learn.org/>_ (>=0.23.1)
  • matplotlib <https://matplotlib.org/>_ (>=3.2.1)
  • Click <https://click.palletsprojects.com/en/7.x/>_ (>=7.0)
  • SharedArray <https://pypi.org/project/SharedArray/>_ (>=3.2.1)

Download dataset

A toy dataset of cgMLST profiles is hosted in this repository. It can be downloaded using this command::

curl -o YERwgMLST.cgMLSTv1.profile.gz https://raw.githubusercontent.com/zheminzhou/pHierCC/master/examples/YERwgMLST.cgMLSTv1.profile.gz

Run pHierCC

pHierCC can be run on the toy dataset using the following command::

pHierCC -p YERwgMLST.cgMLSTv1.profile.gz -o YERwgMLST.cgMLSTv1.HierCC

And the full usage of pHierCC is::

$ pHierCC --help Usage: pHierCC [OPTIONS]

 pHierCC takes a file containing allelic profiles (as in
 https://pubmlst.org/data/) and works out hierarchical clusters of the full
 dataset based on a minimum-spanning tree.

Options: -p, --profile TEXT [INPUT] name of a profile file consisting of a table of columns of the ST numbers and the allelic numbers, separated by tabs. Can be GZIPped. [required]

 -o, --output TEXT            [OUTPUT] Prefix for the output files consisting
                              of a  NUMPY and a TEXT version of the
                              clustering result.   [required]

 -a, --append TEXT            [INPUT; optional] The NPZ output of a previous
                              pHierCC run (Default: None).

 -m, --allowed_missing FLOAT  [INPUT; optional] Allowed proportion of missing
                              genes in pairwise comparisons (Default: 0.03).

 -n, --n_proc INTEGER         [INPUT; optional] Number of processes (CPUs) to
                              use (Default: 4).

 --help                       Show this message and exit.

pHierCC inputs

pHierCC runs in two modes. 'Development mode' builds a multi-level hierarchical clustering scheme from scratch, whilst 'Production mode' assigns new in-coming genomes to clusters incrementally, without changing the cluster assignments of any existing genome. You can find technical details in the Supplementary Text of the bioRxiv preprint <https://doi.org/10.1101/2020.11.25.397539>_.

  • 'Development mode' requires only one file (--profile) containing allelic profiles of cgMLST STs, in either plain text or GZIP format. You can find additional examples of allelic profiles in https://pubmlst.org/data.
  • 'Production mode' is triggered when an additional option, '--append', is provided with a NPZ file consisting a pre-existing multi-level assignment, which is part of the output (see below) of a previous pHierCC run.

pHierCC outputs

Both modes of pHierCC generate two outputs:

  • .npz
  • .HierCC.gz

Both output files contain the same multi-level clustering assigments for every cgMLST ST. The NPZ file is used as input for running pHierCC in production mode, whilst the HierCC.gz file is human readable. The first three lines of the .HierCC.gz is like::

#ST_id HC0 HC1 HC2 HC3 HC4 HC5 HC6 HC7 HC8 HC9 HC10 HC11 HC12 HC13 HC14 HC15 HC16 HC17 HC18 HC19 HC20 HC21 HC22 HC23 HC24 HC25 HC26 HC27 HC28 HC29 HC30 HC31 HC32 HC33 HC34 HC35 HC36 HC37 HC38 HC39 HC40 HC41 HC42 HC43 HC44 HC45 HC46 HC47 HC48 HC49 HC50 HC51 HC52 HC53 HC54 HC55 HC56 HC57 HC58 HC59 HC60 HC61 HC62 HC63 HC64 HC65 HC66 HC67 HC68 HC69 HC70 HC71 HC72 HC73 HC74 HC75 HC76 HC77 HC78 HC79 HC80 HC81 HC82 HC83 HC84 HC85 HC86 HC87 HC88 HC89 HC90 HC91 HC92 HC93 HC94 HC95 HC96 HC97 HC98 HC99 HC100 HC101 HC102 HC103 HC104 HC105 HC106 HC107 HC108 HC109 HC110 HC111 HC112 HC113 HC114 HC115 HC116 HC117 HC118 HC119 HC120 HC121 HC122 HC123 HC124 HC125 HC126 HC127 HC128 HC129 HC130 HC131 HC132 HC133 HC134 HC135 HC136 HC137 HC138 HC139 HC140 HC141 HC142 HC143 HC144 HC145 HC146 HC147 HC148 HC149 HC150 HC151 HC152 HC153 HC154 HC155 HC156 HC157 HC158 HC159 HC160 HC161 HC162 HC163 HC164 HC165 HC166 HC167 HC168 HC169 HC170 HC171 HC172 HC173 HC174 HC175 HC176 HC177 HC178 HC179 HC180 HC181 HC182 HC183 HC184 HC185 HC186 HC187 HC188 HC189 HC190 HC191 HC192 HC193 HC194 HC195 HC196 HC197 HC198 HC199 HC200 HC201 HC202 HC203 HC204 HC205 HC206 HC207 HC208 HC209 HC210 HC211 HC212 HC213 HC214 HC215 HC216 HC217 HC218 HC219 HC220 HC221 HC222 HC223 HC224 HC225 HC226 HC227 HC228 HC229 HC230 HC231 HC232 HC233 HC234 HC235 HC236 HC237 HC238 HC239 HC240 HC241 HC242 HC243 HC244 HC245 HC246 HC247 HC248 HC249 HC250 HC251 HC252 HC253 HC254 HC255 HC256 HC257 HC258 HC259 HC260 HC261 HC262 HC263 HC264 HC265 HC266 HC267 HC268 HC269 HC270 HC271 HC272 HC273 HC274 HC275 HC276 HC277 HC278 HC279 HC280 HC281 HC282 HC283 HC284 HC285 HC286 HC287 HC288 HC289 HC290 HC291 HC292 HC293 HC294 HC295 HC296 HC297 HC298 HC299 HC300 HC301 HC302 HC303 HC304 HC305 HC306 HC307 HC308 HC309 HC310 HC311 HC312 HC313 HC314 HC315 HC316 HC317 HC318 HC319 HC320 HC321 HC322 HC323 HC324 HC325 HC326 HC327 HC328 HC329 HC330 HC331 HC332 HC333 HC334 HC335 HC336 HC337 HC338 HC339 HC340 HC341 HC342 HC343 HC344 HC345 HC346 HC347 HC348 HC349 HC350 HC351 HC352 HC353 HC354 HC355 HC356 HC357 HC358 HC359 HC360 HC361 HC362 HC363 HC364 HC365 HC366 HC367 HC368 HC369 HC370 HC371 HC372 HC373 HC374 HC375 HC376 HC377 HC378 HC379 HC380 HC381 HC382 HC383 HC384 HC385 HC386 HC387 HC388 HC389 HC390 HC391 HC392 HC393 HC394 HC395 HC396 HC397 HC398 HC399 HC400 HC401 HC402 HC403 HC404 HC405 HC406 HC407 HC408 HC409 HC410 HC411 HC412 HC413 HC414 HC415 HC416 HC417 HC418 HC419 HC420 HC421 HC422 HC423 HC424 HC425 HC426 HC427 HC428 HC429 HC430 HC431 HC432 HC433 HC434 HC435 HC436 HC437 HC438 HC439 HC440 HC441 HC442 HC443 HC444 HC445 HC446 HC447 HC448 HC449 HC450 HC451 HC452 HC453 HC454 HC455 HC456 HC457 HC458 HC459 HC460 HC461 HC462 HC463 HC464 HC465 HC466 HC467 HC468 HC469 HC470 HC471 HC472 HC473 HC474 HC475 HC476 HC477 HC478 HC479 HC480 HC481 HC482 HC483 HC484 HC485 HC486 HC487 HC488 HC489 HC490 HC491 HC492 HC493 HC494 HC495 HC496 HC497 HC498 HC499 HC500 HC501 HC502 HC503 HC504 HC505 HC506 HC507 HC508 HC509 HC510 HC511 HC512 HC513 HC514 HC515 HC516 HC517 HC518 HC519 HC520 HC521 HC522 HC523 HC524 HC525 HC526 HC527 HC528 HC529 HC530 HC531 HC532 HC533 HC534 HC535 HC536 HC537 HC538 HC539 HC540 HC541 HC542 HC543 HC544 HC545 HC546 HC547 HC548 HC549 HC550 HC551 HC552 HC553 HC554 HC555 HC556 HC557 HC558 HC559 HC560 HC561 HC562 HC563 HC564 HC565 HC566 HC567 HC568 HC569 HC570 HC571 HC572 HC573 HC574 HC575 HC576 HC577 HC578 HC579 HC580 HC581 HC582 HC583 HC584 HC585 HC586 HC587 HC588 HC589 HC590 HC591 HC592 HC593 HC594 HC595 HC596 HC597 HC598 HC599 HC600 HC601 HC602 HC603 HC604 HC605 HC606 HC607 HC608 HC609 HC610 HC611 HC612 HC613 HC614 HC615 HC616 HC617 HC618 HC619 HC620 HC621 HC622 HC623 HC624 HC625 HC626 HC627 HC628 HC629 HC630 HC631 HC632 HC633 HC634 HC635 HC636 HC637 HC638 HC639 HC640 HC641 HC642 HC643 HC644 HC645 HC646 HC647 HC648 HC649 HC650 HC651 HC652 HC653 HC654 HC655 HC656 HC657 HC658 HC659 HC660 HC661 HC662 HC663 HC664 HC665 HC666 HC667 HC668 HC669 HC670 HC671 HC672 HC673 HC674 HC675 HC676 HC677 HC678 HC679 HC680 HC681 HC682 HC683 HC684 HC685 HC686 HC687 HC688 HC689 HC690 HC691 HC692 HC693 HC694 HC695 HC696 HC697 HC698 HC699 HC700 HC701 HC702 HC703 HC704 HC705 HC706 HC707 HC708 HC709 HC710 HC711 HC712 HC713 HC714 HC715 HC716 HC717 HC718 HC719 HC720 HC721 HC722 HC723 HC724 HC725 HC726 HC727 HC728 HC729 HC730 HC731 HC732 HC733 HC734 HC735 HC736 HC737 HC738 HC739 HC740 HC741 HC742 HC743 HC744 HC745 HC746 HC747 HC748 HC749 HC750 HC751 HC752 HC753 HC754 HC755 HC756 HC757 HC758 HC759 HC760 HC761 HC762 HC763 HC764 HC765 HC766 HC767 HC768 HC769 HC770 HC771 HC772 HC773 HC774 HC775 HC776 HC777 HC778 HC779 HC780 HC781 HC782 HC783 HC784 HC785 HC786 HC787 HC788 HC789 HC790 HC791 HC792 HC793 HC794 HC795 HC796 HC797 HC798 HC799 HC800 HC801 HC802 HC803 HC804 HC805 HC806 HC807 HC808 HC809 HC810 HC811 HC812 HC813 HC814 HC815 HC816 HC817 HC818 HC819 HC820 HC821 HC822 HC823 HC824 HC825 HC826 HC827 HC828 HC829 HC830 HC831 HC832 HC833 HC834 HC835 HC836 HC837 HC838 HC839 HC840 HC841 HC842 HC843 HC844 HC845 HC846 HC847 HC848 HC849 HC850 HC851 HC852 HC853 HC854 HC855 HC856 HC857 HC858 HC859 HC860 HC861 HC862 HC863 HC864 HC865 HC866 HC867 HC868 HC869 HC870 HC871 HC872 HC873 HC874 HC875 HC876 HC877 HC878 HC879 HC880 HC881 HC882 HC883 HC884 HC885 HC886 HC887 HC888 HC889 HC890 HC891 HC892 HC893 HC894 HC895 HC896 HC897 HC898 HC899 HC900 HC901 HC902 HC903 HC904 HC905 HC906 HC907 HC908 HC909 HC910 HC911 HC912 HC913 HC914 HC915 HC916 HC917 HC918 HC919 HC920 HC921 HC922 HC923 HC924 HC925 HC926 HC927 HC928 HC929 HC930 HC931 HC932 HC933 HC934 HC935 HC936 HC937 HC938 HC939 HC940 HC941 HC942 HC943 HC944 HC945 HC946 HC947 HC948 HC949 HC950 HC951 HC952 HC953 HC954 HC955 HC956 HC957 HC958 HC959 HC960 HC961 HC962 HC963 HC964 HC965 HC966 HC967 HC968 HC969 HC970 HC971 HC972 HC973 HC974 HC975 HC976 HC977 HC978 HC979 HC980 HC981 HC982 HC983 HC984 HC985 HC986 HC987 HC988 HC989 HC990 HC991 HC992 HC993 HC994 HC995 HC996 HC997 HC998 HC999 HC1000 HC1001 HC1002 HC1003 HC1004 HC1005 HC1006 HC1007 HC1008 HC1009 HC1010 HC1011 HC1012 HC1013 HC1014 HC1015 HC1016 HC1017 HC1018 HC1019 HC1020 HC1021 HC1022 HC1023 HC1024 HC1025 HC1026 HC1027 HC1028 HC1029 HC1030 HC1031 HC1032 HC1033 HC1034 HC1035 HC1036 HC1037 HC1038 HC1039 HC1040 HC1041 HC1042 HC1043 HC1044 HC1045 HC1046 HC1047 HC1048 HC1049 HC1050 HC1051 HC1052 HC1053 HC1054 HC1055 HC1056 HC1057 HC1058 HC1059 HC1060 HC1061 HC1062 HC1063 HC1064 HC1065 HC1066 HC1067 HC1068 HC1069 HC1070 HC1071 HC1072 HC1073 HC1074 HC1075 HC1076 HC1077 HC1078 HC1079 HC1080 HC1081 HC1082 HC1083 HC1084 HC1085 HC1086 HC1087 HC1088 HC1089 HC1090 HC1091 HC1092 HC1093 HC1094 HC1095 HC1096 HC1097 HC1098 HC1099 HC1100 HC1101 HC1102 HC1103 HC1104 HC1105 HC1106 HC1107 HC1108 HC1109 HC1110 HC1111 HC1112 HC1113 HC1114 HC1115 HC1116 HC1117 HC1118 HC1119 HC1120 HC1121 HC1122 HC1123 HC1124 HC1125 HC1126 HC1127 HC1128 HC1129 HC1130 HC1131 HC1132 HC1133 HC1134 HC1135 HC1136 HC1137 HC1138 HC1139 HC1140 HC1141 HC1142 HC1143 HC1144 HC1145 HC1146 HC1147 HC1148 HC1149 HC1150 HC1151 HC1152 HC1153 HC1154 HC1155 HC1156 HC1157 HC1158 HC1159 HC1160 HC1161 HC1162 HC1163 HC1164 HC1165 HC1166 HC1167 HC1168 HC1169 HC1170 HC1171 HC1172 HC1173 HC1174 HC1175 HC1176 HC1177 HC1178 HC1179 HC1180 HC1181 HC1182 HC1183 HC1184 HC1185 HC1186 HC1187 HC1188 HC1189 HC1190 HC1191 HC1192 HC1193 HC1194 HC1195 HC1196 HC1197 HC1198 HC1199 HC1200 HC1201 HC1202 HC1203 HC1204 HC1205 HC1206 HC1207 HC1208 HC1209 HC1210 HC1211 HC1212 HC1213 HC1214 HC1215 HC1216 HC1217 HC1218 HC1219 HC1220 HC1221 HC1222 HC1223 HC1224 HC1225 HC1226 HC1227 HC1228 HC1229 HC1230 HC1231 HC1232 HC1233 HC1234 HC1235 HC1236 HC1237 HC1238 HC1239 HC1240 HC1241 HC1242 HC1243 HC1244 HC1245 HC1246 HC1247 HC1248 HC1249 HC1250 HC1251 HC1252 HC1253 HC1254 HC1255 HC1256 HC1257 HC1258 HC1259 HC1260 HC1261 HC1262 HC1263 HC1264 HC1265 HC1266 HC1267 HC1268 HC1269 HC1270 HC1271 HC1272 HC1273 HC1274 HC1275 HC1276 HC1277 HC1278 HC1279 HC1280 HC1281 HC1282 HC1283 HC1284 HC1285 HC1286 HC1287 HC1288 HC1289 HC1290 HC1291 HC1292 HC1293 HC1294 HC1295 HC1296 HC1297 HC1298 HC1299 HC1300 HC1301 HC1302 HC1303 HC1304 HC1305 HC1306 HC1307 HC1308 HC1309 HC1310 HC1311 HC1312 HC1313 HC1314 HC1315 HC1316 HC1317 HC1318 HC1319 HC1320 HC1321 HC1322 HC1323 HC1324 HC1325 HC1326 HC1327 HC1328 HC1329 HC1330 HC1331 HC1332 HC1333 HC1334 HC1335 HC1336 HC1337 HC1338 HC1339 HC1340 HC1341 HC1342 HC1343 HC1344 HC1345 HC1346 HC1347 HC1348 HC1349 HC1350 HC1351 HC1352 HC1353 HC1354 HC1355 HC1356 HC1357 HC1358 HC1359 HC1360 HC1361 HC1362 HC1363 HC1364 HC1365 HC1366 HC1367 HC1368 HC1369 HC1370 HC1371 HC1372 HC1373 HC1374 HC1375 HC1376 HC1377 HC1378 HC1379 HC1380 HC1381 HC1382 HC1383 HC1384 HC1385 HC1386 HC1387 HC1388 HC1389 HC1390 HC1391 HC1392 HC1393 HC1394 HC1395 HC1396 HC1397 HC1398 HC1399 HC1400 HC1401 HC1402 HC1403 HC1404 HC1405 HC1406 HC1407 HC1408 HC1409 HC1410 HC1411 HC1412 HC1413 HC1414 HC1415 HC1416 HC1417 HC1418 HC1419 HC1420 HC1421 HC1422 HC1423 HC1424 HC1425 HC1426 HC1427 HC1428 HC1429 HC1430 HC1431 HC1432 HC1433 HC1434 HC1435 HC1436 HC1437 HC1438 HC1439 HC1440 HC1441 HC1442 HC1443 HC1444 HC1445 HC1446 HC1447 HC1448 HC1449 HC1450 HC1451 HC1452 HC1453 HC1454 HC1455 HC1456 HC1457 HC1458 HC1459 HC1460 HC1461 HC1462 HC1463 HC1464 HC1465 HC1466 HC1467 HC1468 HC1469 HC1470 HC1471 HC1472 HC1473 HC1474 HC1475 HC1476 HC1477 HC1478 HC1479 HC1480 HC1481 HC1482 HC1483 HC1484 HC1485 HC1486 HC1487 HC1488 HC1489 HC1490 HC1491 HC1492 HC1493 HC1494 HC1495 HC1496 HC1497 HC1498 HC1499 HC1500 HC1501 HC1502 HC1503 HC1504 HC1505 HC1506 HC1507 HC1508 HC1509 HC1510 HC1511 HC1512 HC1513 HC1514 HC1515 HC1516 HC1517 HC1518 HC1519 HC1520 HC1521 HC1522 HC1523 HC1524 HC1525 HC1526 HC1527 HC1528 HC1529 HC1530 HC1531 HC1532 HC1533 HC1534 HC1535 HC1536 HC1537 HC1538 HC1539 HC1540 HC1541 HC1542 HC1543 HC1544 HC1545 HC1546 HC1547 HC1548 HC1549 HC1550 HC1551 HC1552 HC1553 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1

The first column is the cgMLST ST, and the remaining columns are the clustering results, from almost identical (HC0) to completely different.

Run HCCeval

HCCeval evaluates the thousands of clustering levels generated by pHierCC and identifies potentially biologically meaningful clustering levels. HCCeval can be run on the HierCC results of the toy dataset with the following command::

HCCeval -p YERwgMLST.cgMLSTv1.profile.gz -c YERwgMLST.cgMLSTv1.HierCC.HierCC.gz -o YERwgMLST.cgMLSTv1.HierCC.eval

And the full usage of HCCeval is::

$ HCCeval --help Usage: HCCeval [OPTIONS]

 evalHCC evaluates a HierCC scheme using varied statistic summaries.

Options: -p, --profile TEXT [INPUT] Name of a profile file consisting of a table of columns of the ST numbers and the allelic numbers, separated by tabs. Can be GZIPped. [required]

 -c, --cluster TEXT      [INPUT] Name of the pHierCC text output. Can be
                         GZIPped.  [required]

 -o, --output TEXT       [OUTPUT] Prefix for the two output files.
                         [required]

 -s, --stepwise INTEGER  [INPUT; optional] Evaluate every <stepwise> levels
                         (Default: 10).

 -n, --n_proc INTEGER    [INPUT; optional] Number of processes (CPUs) to use
                         (Default: 4).

 --help                  Show this message and exit.

HCCeval inputs

HCCeval requires two inputs:

  • (--profile) A file containing allelic profiles, in plain text or gzipped (see pHierCC inputs <README.rst#phiercc-inputs>_).
  • (--cluster) The human readable .HierCC.gz output by pHierCC (see pHierCC outputs <README.rst#phiercc-outputs>_).

HCCeval outputs

HCCeval generates two outputs of the same evaluation results:

  • .val.tsv
  • .val.pdf

The PDF file is a visualization of the TSV file. You can find examples of the PDF outputs in the supplemental Figure S1 <https://www.biorxiv.org/content/biorxiv/early/2020/11/26/2020.11.25.397539/DC1/embed/media-1.pdf>_ of the preprint. Both files contain two statistical evaluations of the clustering levels:

  1. Normalized Mutual Information (NMI) <https://en.wikipedia.org/wiki/Mutual_information>_ (Kvalseth TO 1987 <https://ieeexplore.ieee.org/abstract/document/4309069>). Mutual Information measures the similarity of two different clusterings of a dataset as a harmonic mean of homogeneity and completeness. It is similar to the better known Rand Index, but gives more accurate estimates for dataset <https://jmlr.csail.mit.edu/papers/volume17/15-627/15-627> that contains many small clusters, which is often the case for HierCC clustering. HCCeval calculates an NMI score for each pairwise combination of HierCC levels based on the clustering of cgSTs at each level.

  2. Silhouette score <https://en.wikipedia.org/wiki/Silhouette_(clustering)>_ (Rousseeuw PJ 1987 <https://www.sciencedirect.com/science/article/pii/0377042787901257>_). Silhouette score estimates the cohesiveness of a clustering result by measuring how similar a cgST is to both to cgSTs within its own cluster (cohesion) and in comparison to other clusters (separation). The Silhouette score ranges between -1 and +1, where a high value indicates a robust clustering.

In practice, 'stable blocks' are identified from HierCC clustering using NMI. Each stable block of NMI scores consists of a continuous set of HierCC levels that define highly similar clusters (NMI >= 0.9). This indicates that the clusters generated by these HierCC levels are robust to modest changes of the clustering thresholds. The most cohesive HierCC level in each stable block (ie the level within each block with the greatest Silhouette score) is likely to represent natural microbial population structure.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HierCC-1.24.tar.gz (32.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page