Subsampling of rooted phylogenetic trees using phylogenetic diversity and location proportions.
Project description
GEO subsampler
Geo_subsampler subsamples a given phylogenetic tree to rebalance the samples at different locations according to user-specified proportions. Moreover, for each location the kept samples are chosen in a balanced way over the sampling intervals (e.g. months). With these constraints in mind, the script uses phylogenetic diversity [Faith 1992] to pick the samples to be removed. Additional options allow to keep all the samples before a certain data, and to ensure a minimal number of samples picked by location, despite the other criteria.
Article
If you find geo_sampler useful, please cite:
A Zhukova, L Blassel, F Lemoine, M Morel, J Voznica, O Gascuel (2021) Origin, evolution and global spread of SARS-CoV-2 CRAS 344(1): 57-75 doi:10.5802/crbiol.29.
Installation
To install geo_subsampler, first install python 3, then run:
pip3 install geo_subsampler
Input data
As an input, one needs to provide a NON-dated phylogenetical tree in newick format, a metadata table containing tip names, locations and states, in tab-delimited (by default) or csv format (to be specified with '--sep ,' option). To subsample according to user-specified proportions, one should also input a location case counts, as tab(or comma, see above)-separated table whose first column contains locations and the second case counts.
Example
The folder example_data contains an example of an input tree (covid.nwk) representing an early SARS-COV-2 epidemic, the corresponding metadata table (metadata.tab), and a case count table (cases.tab).
The input tree contains 11 167 sampled tips.
The metadata table is a tab-separated file, containing tip ids in the first column, their countries of sampling in the second column, and the sampling dates in the third column:
id | country | sampling date |
---|---|---|
EPI_ISL_402119 | China | 30/12/2019 |
EPI_ISL_402123 | China | 24/12/2019 |
EPI_ISL_403962 | Thailand | 08/01/2020 |
... | ... | ... |
The case count table contains numbers of declared cases for each country:
country | cases |
---|---|
China | 84024 |
Thailand | 3017 |
... | ... |
The following geo_subsampler command subsamples the input tree according to the case proportions and (as much as possible) equally between the months, in order to keep 1000 tips:
geo_subsampler --tree example_data/covid.nwk --metadata example_data/metadata.tab \
--location_column country --date_column "sampling date" --cases example_data/cases.tab \
--output_dir example_data/results --size 1000
The resulting tree is put into example_data/results folder: (covid.subsampled.0.nwk). This folder also contains the ids of the tips retained in the subsampled tree: (covid.subsampled.0.ids), and two tables with the statistics on the subsampling: case_counts.tab and case_counts_per_time.tab.
Detailed options
- --tree TREE Path to the input phylogeny (NOT time-scaled) in newick format.
- --metadata METADATA Path to the metadata table containing location and date annotations, in a tab-delimited format.
- --sep SEP Separator used in the metadata and case tables. By default a tab-separated table is assumed.
- --index_column INDEX_COLUMN number (starting from zero) of the index column (containing tree tip names) in the metadata table. By default is the first column (corresponding to the number 0)
- --location_column LOCATION_COLUMN name of the column containing location annotations in the metadata table.
- --date_column DATE_COLUMN name of the column containing date annotations in the metadata table.
- --cases CASES A tab-separated file with two columns. The first column lists the locations, while the second column contains the numbers of declared cases or proportions for the corresponding locations
- --start_date START_DATE If specified, all the cases before this date will be included in all the sub-sampled data sets.
- --size SIZE Target size of the sub-sampled data set (in number of samples). By default, will be set to a half of the data set represented by the input tree.
- --repetitions REPETITIONS Number of sub-sampled trees to produce. By default 1.
- --output_dir OUTPUT_DIR Path to the directory where the sub-sampled results should be saved.
- --min_cases MIN_CASES Minimum number of samples to retain for each location.
- --date_precision {year,month,day} Precision for homogeneous subsampling over time within each location. By default (month) will aim at distributing selected location samples equally over months.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file geo_subsampler-0.2.tar.gz
.
File metadata
- Download URL: geo_subsampler-0.2.tar.gz
- Upload date:
- Size: 22.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a0c023301e24d49ff078fababb7cccda979ff97cd01754b91d414cd7e3883df |
|
MD5 | e26b2b2ae5483568bc5faa25d2aff15d |
|
BLAKE2b-256 | 42bc24c831821b6787b817c331780870176b0db1c143bfdc47f1764314f1514b |
File details
Details for the file geo_subsampler-0.2-py3-none-any.whl
.
File metadata
- Download URL: geo_subsampler-0.2-py3-none-any.whl
- Upload date:
- Size: 23.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 561fc9207043c1d46532cf5f5bb3189819efbfbf38c832b0a361c382b6c67ba3 |
|
MD5 | 6c4fc4dd27956e3f6396e106a415910b |
|
BLAKE2b-256 | 7b168b5dc05c60fbace79a6fb2a8fea66af7d3a51308a6b4d1c963cd59dd8dc3 |