Visualize lineages overtime, with phylogentic context, based on viral genomes
Project description
epiMuller README
About
Author
Jennifer L Havens
Purpose
Visualize lineages overtime, with phylogentic context, based on viral genomes
Language
Python3
Inputs
Alingment, collection date, PANGO lineage, Nextstain JSON files, and timetree
Source code avaliblity
Documentation avaliblity
Quick start
pip install epimuller
epimuller [-h] [-oDir OUTDIRECTORY] -oP OUTPREFIX -n
INNEXTSTRAIN -m INMETA [-p INPANGOLIN]
[-f TRAITOFINTERSTFILE]
[-k TRAITOFINTERSTKEY]
[-aa AAVOCLIST [AAVOCLIST ...]]
[-t TIMEWINDOW] [-s STARTDATE] [-e ENDDATE]
[-mt MINTIME] [-min MINTOTALCOUNT]
[-c CASES_NAME] [-l {date,time}]
[-lp {Right,Max,Start,End}]
SOME EXAMPLES
Examples for full run
To prep files for these examples for epimuller look at Example_CommandsFromScratch.txt
epimuller \
-n inputData/GISAID_NYCPHL_04_29/02_nextstrainResults \
-m inputData/GISAID_NYCPHL_04_29/gisaid_2021_04_30_00_rename.tsv \
-oDir 03_results_NYCPHL_April29 \
-oP 01_defaultAAList \
-c inputData/CITY_US-NY_NYC_outbreakinfo_epidemiology_data_2021-04-30.tsv
epimuller \
-n inputData/GISAID_NYCPHL_04_29/02_nextstrainResults \
-m inputData/GISAID_NYCPHL_04_29/gisaid_2021_04_30_00_rename.tsv \
-oDir 03_results_NYCPHL_April29 \
-oP 02_pangolin \
-c inputData/CITY_US-NY_NYC_outbreakinfo_epidemiology_data_2021-04-30.tsv \
--traitOfInterstFile traits.json \
--traitOfInterstKey lineage \
-lp Max \
-min 100 \
epimuller \
-n inputData/GISAID_NYCPHL_04_29/02_nextstrainResults \
-m inputData/GISAID_NYCPHL_04_29/gisaid_2021_04_30_00_rename.tsv \
-oDir 03_results_NYCPHL_April29 \
-oP 03_selectedAA \
-c inputData/CITY_US-NY_NYC_outbreakinfo_epidemiology_data_2021-04-30.tsv \
-aa 'SE484K' 'S*452*' \
-min 50 \
-mt 20
Known edge cases / featrues to add
Known edge cases which are not correctly dealt with or features I intend to add (that I will get around to fixing eventually) If you run into anything else please let me know on https://github.com/jennifer-bio/epimuller
- nt_muts ; not set up for nt mutations (only amino acid or trait)
- only takes nextstrain json files - intending to set up to take treetime output
- feel free to ignore the undefined.svg that gets made - it is related to checking the size of the text to space out labels
- add plot and font size to arg options
Addtional features
Color
If you would like to specify color for clade: in --parentHierarchy_name file (of drawMuller.py input) add col with name: "color" and hex color value (starting with #) for clades you want to specify.
Plot and font size
In the file: scripts/drawMuller.py ; near top of script change value for desired WIDTH, HEIGHT , LEGENDWIDTH (space on right side of plot for labels), MARGIN, or FONTSIZE variables Then run with source code by > python scripts/drawMuller.py [Arugments]
Parse GISAID fasta for metadata
epimuller-parse If you have downloaded sequences from GISAID under the search tab, you can parse out the names into a metadata file (format tested as of 2021-04-30)
ARGUMENTS
optional arguments:
-h, --help show this help message and exit
Options for full repot:
-oDir OUTDIRECTORY, --outDirectory OUTDIRECTORY
folder for output (default: ./)
-oP OUTPREFIX, --outPrefix OUTPREFIX
prefix of out files withen outDirectory (default:
None)
Options passed to epimuller-define:
-n INNEXTSTRAIN, --inNextstrain INNEXTSTRAIN
nextstrain results with tree.nwk and
[traitOfInterst].json (default: None)
-m INMETA, --inMeta INMETA
metadata tsv with 'strain' and 'date'cols, optional:
cols of trait of interst; and pangolin col named:
'lineage' or 'pangolin_lin' (default: None)
-p INPANGOLIN, --inPangolin INPANGOLIN
pangolin output lineage_report.csv file, if argument
not supplied looks in inMeta for col with
'pangolin_lin' or 'lineage' (default: metadata)
-f TRAITOFINTERSTFILE, --traitOfInterstFile TRAITOFINTERSTFILE
name of nextstrain [traitOfInterst].json in
'inNextstrain' folder (default: aa_muts.json)
-k TRAITOFINTERSTKEY, --traitOfInterstKey TRAITOFINTERSTKEY
key for trait of interst in json file (default:
aa_muts)
-aa AAVOCLIST [AAVOCLIST ...], --aaVOClist AAVOCLIST [AAVOCLIST ...]
list of aa of interest in form
[GENE][*ORAncAA][site][*ORtoAA] ex. S*501*, gaps
represed by X (default: None)
-t TIMEWINDOW, --timeWindow TIMEWINDOW
number of days for sampling window (default: 7)
-s STARTDATE, --startDate STARTDATE
start date in iso format YYYY-MM-DD or 'firstDate'
which sets start date to first date in metadata
(default: 2020-03-01)
-e ENDDATE, --endDate ENDDATE
end date in iso format YYYY-MM-DD or 'lastDate' which
sets end date as last date in metadata (default:
lastDate)
Options passed to epimuller-draw:
-mt MINTIME, --MINTIME MINTIME
minimum time point to start plotting (default: 30)
-min MINTOTALCOUNT, --MINTOTALCOUNT MINTOTALCOUNT
minimum total count for group to be included (default:
10)
-c CASES_NAME, --cases_name CASES_NAME
file with cases - formated with 'date' in ISO format
and 'confirmed_rolling' cases, in tsv format (default:
None)
-l {date,time}, --xlabel {date,time}
Format of x axis label: ISO date format or timepoints
from start (default: date)
-lp {Right,Max,Start,End}, --labelPosition {Right,Max,Start,End}
choose position of clade labels (default: Right)
Only make abundance and hiearchy files
usage: epimuller-define [-h] -n INNEXTSTRAIN -m INMETA [-p INPANGOLIN]
[-f TRAITOFINTERSTFILE] [-k TRAITOFINTERSTKEY]
[-aa AAVOCLIST [AAVOCLIST ...]]
[-oDir OUTDIRECTORY] -oP OUTPREFIX
[-t TIMEWINDOW] [-s STARTDATE] [-e ENDDATE]
optional arguments:
-h, --help show this help message and exit
-n INNEXTSTRAIN, --inNextstrain INNEXTSTRAIN
nextstrain results with tree.nwk and
[traitOfInterst].json (default: None)
-m INMETA, --inMeta INMETA
metadata tsv with 'strain' and 'date'cols, optional:
cols of trait of interst; and pangolin col named:
'lineage' or 'pangolin_lin' (default: None)
-p INPANGOLIN, --inPangolin INPANGOLIN
pangolin output lineage_report.csv file, if argument
not supplied looks in inMeta for col with
'pangolin_lin' or 'lineage' (default: metadata)
-f TRAITOFINTERSTFILE, --traitOfInterstFile TRAITOFINTERSTFILE
name of nextstrain [traitOfInterst].json in
'inNextstrain' folder (default: aa_muts.json)
-k TRAITOFINTERSTKEY, --traitOfInterstKey TRAITOFINTERSTKEY
key for trait of interst in json file (default:
aa_muts)
-aa AAVOCLIST [AAVOCLIST ...], --aaVOClist AAVOCLIST [AAVOCLIST ...]
list of aa of interest in form
[GENE][*ORAncAA][site][*ORtoAA] ex. S*501*, gaps
represed by X (default: None)
-oDir OUTDIRECTORY, --outDirectory OUTDIRECTORY
folder for output (default: ./)
-oP OUTPREFIX, --outPrefix OUTPREFIX
prefix of out files withen outDirectory (default:
None)
-t TIMEWINDOW, --timeWindow TIMEWINDOW
number of days for sampling window (default: 7)
-s STARTDATE, --startDate STARTDATE
start date in iso format YYYY-MM-DD or 'firstDate'
which is in metadata (default: 2020-03-01)
-e ENDDATE, --endDate ENDDATE
end date in iso format YYYY-MM-DD or 'lastDate' which
is in metadata (default: lastDate)
Only plot
usage: epimuller-draw [-h] -p PARENTHIERARCHY_NAME -a ABUNDANCE_NAME
[-c CASES_NAME] -o OUTFOLDER [-mt MINTIME]
[-min MINTOTALCOUNT] [-l {date,time}]
[-lp {Right,Max,Start,End}]
optional arguments:
-h, --help show this help message and exit
-p PARENTHIERARCHY_NAME, --parentHierarchy_name PARENTHIERARCHY_NAME
csv output from mutationLinages_report.py with child
parent col (default: None)
-a ABUNDANCE_NAME, --abundance_name ABUNDANCE_NAME
csv output from mutationLinages_report.py with
abundances of clades (default: None)
-c CASES_NAME, --cases_name CASES_NAME
file with cases - formated with 'date' in ISO format
and 'confirmed_rolling' cases, in tsv format (default:
None)
-o OUTFOLDER, --outFolder OUTFOLDER
csv output from mutationLinages_report.py with child
parent col (default: None)
-mt MINTIME, --MINTIME MINTIME
minimum time point to start plotting (default: 30)
-min MINTOTALCOUNT, --MINTOTALCOUNT MINTOTALCOUNT
minimum total count for group to be included (default:
10)
-l {date,time}, --xlabel {date,time}
Format of x axis label: ISO date format or timepoints
from start (default: date)
-lp {Right,Max,Start,End}, --labelPosition {Right,Max,Start,End}
choose position of clade labels (default: Right)
Citation
Please link to this github if you have used epimuller in your research.
Extra notes on GISAID
If you do use GISAID data please acknowledge the contributers, such as with language suggested by GISAID.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for epimuller-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9844185d7241949fdab84f110cd37ef06fb6dc0a93b93553dcb51ccd2909772d |
|
MD5 | f4e4b4a5c6cd4caae43e2435b05ad250 |
|
BLAKE2b-256 | 57892fa02f40caf2b6ceb93847d9a2b9e54817749d7681bf1e56dde92f936514 |