Infer ARGs for SARS-Cov2 in tskit format

These details have not been verified by PyPI

Project description

sc2ts

Sc2ts stands for "SARS-CoV-2 to tree sequence" (pronounced "scoots" optionally) and consists of

A method to infer Ancestral Recombination Graphs (ARGs) from SARS-CoV-2 data at pandemic scale
A lightweight wrapper around tskit Python APIs specialised for the output of sc2ts which enables efficient node metadata access.
A lightweight wrapper around Zarr Python which enables convenient and efficient access to the full Viridian dataset (alignments and metadata) in a single file using the VCF Zarr specification.

Please see the preprint for details.

Installation

Install sc2ts from PyPI:

python -m pip install sc2ts

This installs the minimum requirement to enable the ARG analysis and Datasets. To run inference, you must install some extra dependencies using the 'inference' optional extra:

python -m pip install sc2ts[inference]

ARG analysis API

The sc2ts API provides two convenience functions to compute summary dataframes for the nodes and mutations in a sc2ts-output ARG.

To see some examples, first download the (31MB) sc2ts inferred ARG from Zenodo:

curl -O https://zenodo.org/records/17558489/files/sc2ts_viridian_v1.2.trees.tsz

We can then use these like

import sc2ts
import tszip

ts = tszip.load("sc2ts_viridian_v1.2.trees.tsz")

df_node = sc2ts.node_data(ts)
df_mutation = sc2ts.mutation_data(ts)

See the live demo for a browser based interactive demo of using these dataframes for real-time pandemic-scale analysis.

Dataset API

Sc2ts also provides a convenient API for accessing large-scale alignments and metadata stored in VCF Zarr format.

Resources:

See this notebook for an example in which we access the data variant-by-variant and which explains the low-level data encoding
See the VCF Zarr publication for more details on and benchmarks on this dataset.

TODO Add some references to API documentation

Inference

Command line inference

Inference is intended to be run from the command-line primarily, and most likely orchestrated via a shell script or Snakemake file, etc.

The CLI is split into subcommands. Get help by running the CLI without arguments:

python3 -m sc2ts

TODO document the process of getting a Zarr dataset and using it

Inference

Here we'll run through a quick example of how to get inference running on a local machine using an example config file, using the Viridian data downloaded from Zenodo.

Prerequisites

First, install the "inference" version of sc2ts from pypi:

python -m pip install sc2ts[inference]

This is essential! The base install of sc2ts contains the minimal dependencies required to access the analysis utilities outlined above.

Then, download the (401MB) Viridian dataset in VCF Zarr format from Zenodo:

curl -O https://zenodo.org/records/16314739/files/viridian_mafft_2024-10-14_v1.vcz.zip

CLI

Inference is performed using the CLI, which is composed of number of subcommands. See the online help for more information:

python -m sc2ts --help

Primary inference

Primary inference is performed using the infer subcommand of the CLI, and all parameters are specified using a toml file.

The example config file can be used to perform inference over a short period, to demonstrate how sc2ts works:

python3 -m sc2ts infer example_config.toml --stop=2020-02-02

Once this finishes (it should take a few minutes and requires ~5GB RAM), the results of the inference will be in the example_inference directory (as specified in the config file) and look something like this:

$ tree example_inference
example_inference
├── ex1
│   ├── ex1_2020-01-01.ts
│   ├── ex1_2020-01-10.ts
│   ├── ex1_2020-01-12.ts
│   ├── ex1_2020-01-19.ts
│   ├── ex1_2020-01-24.ts
│   ├── ex1_2020-01-25.ts
│   ├── ex1_2020-01-28.ts
│   ├── ex1_2020-01-29.ts
│   ├── ex1_2020-01-30.ts
│   ├── ex1_2020-01-31.ts
│   ├── ex1_2020-02-01.ts
│   └── ex1_init.ts
├── ex1.log
└── ex1.matches.db

Here we've run inference for all dates in January 2020 for which we have data, plus the 1st Feb. The results of inference for each day are stored in the example_inference/ex1 directory as tskit files representing the ARG inferred up to that day. There is a lot of redundancy in keeping all these daily files lying around, but it is useful to be able to go back to the state of the ARG at a particular date and they don't take up much space.

The file ex1.log contains the log file. The config file set the log-level to 2, which is full debug output. There is a lot of useful information in there, and it can be very helpful when debugging, so we recommend keeping the logs.

The ex1.matches.db is the "match DB" which stores information about the HMM match for each sample. This is mainly used to store exact matches found during inference.

The ARGs output during primary inference (this step here) have a lot of debugging metadata included (see the section on the Debug utilities below)

Primary inference can be stopped and picked up again at any point using the --start option.

Postprocessing

Once we've finished primary inference we can run postprocessing to perform a few housekeeping tasks. Continuing the example above:

$ python3 -m sc2ts postprocess -vv \
    --match-db example_inference/ex1.matches.db \
    example_inference/ex1/ex1_2020-02-01.ts     \
    example_inference/ex1_2020-02-01_pp.ts

Among other things, this incorporates the exact matches in the match DB into the final ARG.

Generating final analysis file

To generate the final analysis-ready file (used as input to the analysis APIs above) we need to run minimise-metadata. This removes all but the most necessary metadata from the ARG, and recodes node metadata using the struct codec for efficiency. On our example above:

$ python -m sc2ts minimise-metadata \
    -m strain sample_id \
    -m Viridian_pangolin pango \
    example_inference/ex1_2020-02-01_pp.ts \
    example_inference/ex1_2020-02-01_pp_mm.ts

This recodes the metadata in the input tree sequence such that the existing strain field is renamed to sample_id (for compatibility with VCF Zarr) and the Viridian_pangolin field (extracted from the Viridian metadata) is renamed to pango.

We can then use the analysis APIs on this file:

import sc2ts
import tskit

ts = tskit.load("example_inference/ex1_2020-02-01_pp_mm.ts")
dfn = sc2ts.node_data(ts)
print(dfn)

giving something like:

   pango         sample_id  node_id  is_sample  is_recombinant  num_mutations       date
0         Vestigial_ignore        0      False           False              0 2019-12-25
1          Wuhan/Hu-1/2019        1      False           False              0 2019-12-26
2      A       SRR11772659        2       True           False              1 2020-01-19
3      B       SRR11397727        3       True           False              0 2020-01-24
4      B       SRR11397730        4       True           False              0 2020-01-24
..   ...               ...      ...        ...             ...            ...        ...
60     A       SRR11597177       60       True           False              0 2020-01-30
61     A       SRR11597197       61       True           False              0 2020-01-30
62     B       SRR11597144       62       True           False              0 2020-02-01
63     B       SRR11597148       63       True           False              0 2020-02-01
64     B       SRR25229386       64       True           False              0 2020-02-01

Development

To run the unit tests, use

python3 -m pytest

You may need to regenerate some cached test fixtures occasionaly (particularly if getting cryptic errors when running the test suite). To do this, run

rm -fR tests/data/cache/

and rerun tests as above.

Debug utilities

The tree sequence files output during primary inference have a lot of debugging metadata, and there are some developer tools for inspecting this in the sc2ts.debug package. In particular, the ArgInfo class has a lot of useful utilities designed to be used in a Jupyter notebook. Note that matplotlib is required for these. Use it like:

import sc2ts.debug as sd
import tskit

ts = tskit.load("path_to_daily_inference.ts")
ai = sd.ArgInfo(ts)
ai # view summary in notebook

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.2

Mar 5, 2026

1.0.1

Nov 28, 2025

1.0.0

Nov 23, 2025

1.0.0b1 pre-release

Nov 21, 2025

This version

1.0.0a1 pre-release

Nov 19, 2025

0.0.3

May 31, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sc2ts-1.0.0a1-py3-none-any.whl (113.2 kB view details)

Uploaded Nov 19, 2025 Python 3

File details

Details for the file sc2ts-1.0.0a1-py3-none-any.whl.

File metadata

Download URL: sc2ts-1.0.0a1-py3-none-any.whl
Upload date: Nov 19, 2025
Size: 113.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for sc2ts-1.0.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e6abc6c7672126b7a24352607fca0a2b51ef59037aee778ae8a36d29cb38ff0`
MD5	`4237170b40aebd383c6cc36873b32f86`
BLAKE2b-256	`24290110884d3ff17bb8b27f5f3ab5824dd2fe65d436d3cd761e850eeff533f2`

See more details on using hashes here.

sc2ts 1.0.0a1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

sc2ts

Installation

ARG analysis API

Dataset API

Inference

Command line inference

Inference

Prerequisites

CLI

Primary inference

Postprocessing

Generating final analysis file

Development

Debug utilities

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes