No project description provided
Project description
Pango-collapse
CLI to collapse Pango lineages for reporting
Install
Install from pypi with pip.
pip install pango-collapse
Usage
pango-collapse
takes a CSV file of SARS-CoV-2 samples (input.csv
) with a column (default Lineage
) indicating the pango lineage of the samples (e.g. output from pangoLEARN, nextclade, USHER, etc).
$ cat input.csv
Lineage |
---|
BA.5.2.1 |
BA.4.6 |
BE.1 |
pango-collapse
will collapse lineages up to the first user defined parent lineage (specified in a text file with --collapse-file
). If the sample lineage has no parent lineage in the user defined collapse file the compressed lineage will be returned. Collapse up to either A
or B
by adding A and B to the collapse file. By default (i.e. if no collapse file is specified) pango-collapse
uses the collapse file found here. This file is dependant on the version of pango-collapse
, use --latest
to load the latest version of the collapse file from github at run time.
$ cat collapse.txt
BA.5
BE.1
pango-collapse
will produce an output file which is a copy of the input file plus Lineage_full
(the uncompressed lineage), Lineage_expanded
(the expanded lineage format) and Lineage_family
(the lineage collapsed up to) columns.
pango-collapse input.csv --collapse-file collapse.txt -o output.csv
$ cat output.csv
Lineage | Lineage_full | Lineage_family | Lineage_expanded |
---|---|---|---|
BA.5.2.1 | B.1.1.529.5.2.1 | BA.5 | B.1.1.529:BA.5.2.1 |
BA.4.6 | B.1.1.529.4.6 | BA.4.6 | B.1.1.529:BA.4.6 |
BE.1 | B.1.1.529.5.3.1.1 | BE.1 | B.1.1.529:BA.5.3.1:BE.1 |
Expanded lineage format
The Lineage_expanded
column provides and human readable and searchable version of pango linages. The delimiter (:
) separates each alias level in the full lineage. You can determine the linage parental lineages of a lineage in expanded format by reading from right to left. For example in the lineage B.1.1.529:BA.5.3.1:BE.1
we can see that BE.1
comes from BA.5.3.1
which inturn comes from B.1.1.529
.
Expanded lineages can be converted to full lineages by removing the delimiter and sub lineage letters. Collapsed lineages can be obtained by taking the final component of the expanded lineage.
$ echo "B.1.1.529:BA.5.3.1:BE.1" | sed -E 's/:[A-Za-z]+//g'
B.1.1.529.5.3.1.1 # full lineage
$ echo "B.1.1.529:BA.5.3.1:BE.1" | awk -F: '{print $NF}'
BE.1 # compressed lineage
Lineages to the right of the delimiter are equivalent (although the parental lineages are implicit).
B.1.1.529:BA.5.3.1:BE.1 == BA.5.3.1:BE.1 == BE.1
Lineages in expanded format are easily searched with regex. Exact matches can be found by matching with the end of the expanded lineage using the $
anchor e.g :BE.1$
to exactly mach the BE.1 lineage. Sub lineages can be found by simply checking if the expanded lineage contains the parental lineage of interest.
$ grep ":BA.5" output.csv
BA.5.2.1,B.1.1.529.5.2.1,BA.5,B.1.1.529:BA.5.2.1
BE.1,B.1.1.529.5.3.1.1,BE.1,B.1.1.529:BA.5.3.1:BE.1
Nextclade example
This example shows how to use some of the pango-collapse
features by collapsing the Pango Lineages in the output from Nextclade.
Produce a nextclade.tsv file from a nextclade
analysis (there is an example file in tests/data
).
We are only interested in the major sub-lineages of omicron i.e. BA.1-BA.5. We can therefor make a collapse file with the following:
$ cat collapse.txt
BA.1
BA.2
BA.3
BA.4
BA.5
Note: BA is an alias of B.1.1.529, however, as we have not included B.1.1.529 in our collapse file any samples designated B.1.1.529 will not be included.
Run the following command to collapse the omicron sub-lineages:
pango-collapse -c collapse.txt -o nextclade_collapsed_omicron.tsv -l Nextclade_pango --strict nextclade.tsv
The -l
(--lineage-column
) flag tells pango-collapse
to look for the compressed linage in the Nextclade_pango
column in the nextclade.tsv file.
The --strict
tells pango-collapse
to use strict mode i.e. only report lineages in the collapse file. If the lineage cannot be collapsed then no value is returned in the collapse column.
We can visualise the results in pandas:
import pandas as pd
df = pd.read_csv("nextclade_output.tsv", sep="\t")
df.Lineage_family.fillna('Other', inplace=True)
df.Lineage_family.value_counts().plot(kind='bar')
--help
Usage: pango-collapse [OPTIONS] INPUT
Collapse Pango sublineages up to user defined parent lineages.
╭─ Arguments ─────────────────────────────────────────────────────────────────────╮
│ * input FILE Path to input CSV/TSV with Lineage column. │
│ [default: None] │
│ [required] │
╰─────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────╮
│ * --output -o FILE Path to output CSV/TSV with Lineage │
│ column. │
│ [default: None] │
│ [required] │
│ --collapse-file -c PATH Path to collapse file with lineages (one │
│ per line) to collapse up to. Defaults to │
│ collapse file shipped with this version │
│ of pango-collapse. │
│ [default: │
│ /Users/wwirth/Library/CloudStorage/OneD… │
│ --lineage-column -l TEXT Column to extract from input file for │
│ lineage. │
│ [default: Lineage] │
│ --full-column -f TEXT Column to use for the uncompressed │
│ output. │
│ [default: Lineage_full] │
│ --collapse-column -k TEXT Column to use for the collapsed output. │
│ [default: Lineage_family] │
| --expand-column -e TEXT Column to use for the expanded output. |
| [default: Lineage_expanded] |
│ --alias-file -a PATH Path to Pango Alias file for │
│ pango_aliasor. Will download latest file │
│ if not supplied. │
│ [default: None] │
│ --strict -s If a lineage is not in the collapse file │
│ return None instead of the compressed │
│ lineage. │
│ --latest -u Load the collapse from from a url │
│ (--url). │
│ --url TEXT Url to use when loading the collapse │
│ file with --latest. │
│ [default: │
│ https://raw.githubusercontent.com/MDU-P… │
│ --version -v Print the current version number and │
│ exit. │
│ --install-completion Install completion for the current │
│ shell. │
│ --show-completion Show completion for the current shell, │
│ to copy it or customize the │
│ installation. │
│ --help -h Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────╯
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pango_collapse-0.7.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd468b870d2bdcc27e3e462f12eebeb85c8c5489b94aa888b96bcf2656b18b1a |
|
MD5 | e5bb3e5c83f31c2a620188d2b30c555d |
|
BLAKE2b-256 | 22cd32322f3e957293bce7d4a778451dd81b068a6982c03d628e52bd34f3481d |