Skip to main content

Track the persistence (or loss) of mutations during long-term passaging

Project description

Code style: black PyPI version

██    ██  █████  ██████  ████████ ██████   █████   ██████ ██   ██ ███████ ██████
██    ██ ██   ██ ██   ██    ██    ██   ██ ██   ██ ██      ██  ██  ██      ██   ██
██    ██ ███████ ██████     ██    ██████  ███████ ██      █████   █████   ██████
 ██  ██  ██   ██ ██   ██    ██    ██   ██ ██   ██ ██      ██  ██  ██      ██   ██
  ████   ██   ██ ██   ██    ██    ██   ██ ██   ██  ██████ ██   ██ ███████ ██   ██


vartracker

A bioinformatics pipeline to summarise variants called against a reference in a longitudinal study design. Written to investigate longitudinal sequencing data from long-term passaging of SARS-CoV-2. In theory, it could be expanded for other organisms too.

Author: Dr Charles Foster

Table of Contents

Features

  • Track mutation persistence across longitudinal samples
  • Comprehensive variant analysis including amino acid consequences
  • Built-in SARS-CoV-2 reference data and annotations
  • Integration with functional mutation databases (pokay)
  • Automated plotting and statistical analysis
  • Support for both SNPs and indels
  • Quality control metrics for variants

Installation

Requires Python 3.11 or newer.

Recommended installation (mamba)

Create the fully pinned, reproducible environment (uses strict channel priority as configured in environment.yml), then install vartracker from PyPI:

mamba create -n vartracker -f environment.yml
mamba activate vartracker
pip install vartracker

If you prefer conda:

conda env create -n vartracker -f environment.yml
conda activate vartracker
pip install vartracker

On Apple Silicon (macOS ARM), the environment may not solve with native packages. Use the x86_64 subdir or Docker instead:

CONDA_SUBDIR=osx-64 mamba create -n vartracker -f environment.yml
mamba activate vartracker

From PyPI (Python-only)

pip install vartracker

External Dependencies

vartracker shells out to a handful of bioinformatics tools. Make sure they are discoverable on PATH before running the CLI. Minimum tested versions are tracked in docs/DEPENDENCIES.md.

  • bcftools and tabix – required for all modes
  • samtools, lofreq, fastp, bwa, and snakemake – required for the bam and end-to-end Snakemake workflows

If you only plan to run vartracker vcf against pre-generated VCFs, the first pair is sufficient. The additional tools are needed whenever you ask vartracker to align reads or call variants for you.

Note: the pinned micromamba environment installs tabix/bgzip via htslib.

Installing bcftools and tabix

On macOS:

# Using Homebrew
brew install bcftools htslib samtools fastp bwa
# lofreq is available via bioconda (requires conda/mamba)
conda install -c bioconda lofreq

# Using MacPorts
sudo port install bcftools htslib samtools fastp bwa

On Linux (Ubuntu/Debian):

sudo apt-get update
sudo apt-get install bcftools tabix samtools fastp bwa
# lofreq is easiest to install via bioconda on Debian-based systems:
conda install -c bioconda lofreq

On Linux (CentOS/RHEL/Fedora):

# CentOS/RHEL with EPEL
sudo yum install epel-release
sudo yum install bcftools htslib samtools fastp bwa

# Fedora
sudo dnf install bcftools htslib samtools fastp bwa
# Install lofreq via bioconda on RPM-based systems:
conda install -c bioconda lofreq

Using conda:

conda install -c bioconda bcftools samtools tabix fastp bwa lofreq

Development Installation

For development or to get the latest version (requires Python 3.11+):

git clone https://github.com/charlesfoster/vartracker.git
cd vartracker
pip install -e .[dev]
pre-commit install

Docker

Build a container image that bundles Python, vartracker, and all external bioinformatics tools:

docker build -t vartracker:latest .

Docker is a self-contained reproducible option. If you publish the image, record the digest and set it when running to include it in the run manifest:

export VARTRACKER_CONTAINER_IMAGE=ghcr.io/your-org/vartracker:2.0.0
export VARTRACKER_CONTAINER_DIGEST=sha256:...

Run workflows by mounting your data directory into the container. The command below analyses an input CSV located in the current directory and writes results beside it:

docker run --rm -v "$(pwd)":/workspace vartracker \
  vcf /workspace/inputs/vcf_inputs.csv \
  --outdir /workspace/results

Quick Start

After installation, vartracker will be available as a command-line tool:

vartracker --help

Typical commands

# Analyse pre-called VCFs plus coverage files
vartracker vcf path/to/vcf_inputs.csv --outdir results/vcf_run

# Run BAMs through the Snakemake workflow, then summarise variants
vartracker bam path/to/bam_inputs.csv \
  --snakemake-outdir work/bam_pipeline \
  --outdir results/bam_summary

# Start from raw reads (FASTQ) and run the full pipeline
vartracker end-to-end path/to/read_inputs.csv \
  --cores 12 \
  --outdir results/e2e_summary

# Generate a template spreadsheet for a directory of files
vartracker generate --mode e2e --dir data/passaging --out inputs.csv

# Exercise the bundled smoke-test dataset
vartracker vcf --test
vartracker bam --test
vartracker end-to-end --test

All modes understand --test, which copies the example dataset from vartracker/test_data into a temporary directory, resolves relative paths, and runs the appropriate workflow.

Input Spreadsheets

Every CLI mode reads the same canonical columns:

  • sample_name (required) – display name for the sample
  • sample_number (required) – passage/order index used in longitudinal plots
  • reads1, reads2 – FASTQ paths (required for end-to-end, optional elsewhere). The pipeline runs in single-end mode (leave the reads2 column empty) but the results are less well tested.
  • bam – BAM file aligned against the SARS-CoV-2 reference
  • vcf – bgzipped VCF containing variant calls with depth (DP) and allele-frequency tags
  • coverage – per-base coverage TSV with columns reference<TAB>position<TAB>depth

Mode-specific expectations:

  • VCF mode requires vcf and coverage, while leaving reads*/bam empty.
  • BAM mode requires bam and will fill vcf + coverage during the workflow.
  • End-to-end mode requires reads1 (and optionally reads2); remaining fields are generated.

Relative paths are resolved with respect to the CSV location, so you can store the sheet alongside your sequencing artefacts. The generate subcommand can scaffold a CSV and highlight missing files.

Coverage files can be produced with samtools depth -aa sample.bam > sample_depth.txt or bedtools genomecov -ibam sample.bam -d. The file name suffix does not matter; vartracker checks for both .depth.txt and _depth.txt patterns when preparing its internal test dataset.

Mode-specific options

  • vartracker vcf – accepts plotting and filtering options such as --min-snv-freq, --min-indel-freq, --allele-frequency-tag, --name, --outdir, --passage-cap, --manifest-level, and pokay controls (--search-pokay, --pokay-csv, --download-pokay). Use --test to run the bundled smoke test.
  • vartracker bam – everything from vcf, plus Snakemake options: --snakemake-outdir, --cores, --snakemake-dryrun, --verbose, --redo.
  • vartracker end-to-end – similar to bam, with an optional --primer-bed for amplicon clipping.
  • vartracker generate – specify --mode (vcf, bam, or e2e), --dir to scan, --out for the CSV, and --dry-run to preview without writing a file.

Using with pokay Database

To search mutations against functional databases:

  1. Set up pokay database (optional):
parse_pokay pokay_database.csv

This command automatically downloads the required literature files from the pokay repository into pokay_literature/NC_045512 (override with --download-dir) and writes the processed CSV for downstream analysis. You can also let vartracker download and parse the database on demand with the --download-pokay flag.

  1. Run vartracker with pokay search:
vartracker input_data.csv --search-pokay --pokay-csv pokay_database.csv -o results/

Alternatively, omit --pokay-csv and pass --download-pokay to fetch the database automatically during execution.

Command Line Reference

usage: main.py [-h] [-V] {vcf,bam,end-to-end,e2e,generate,describe-output} ...

positional arguments:
  {vcf,bam,end-to-end,e2e,generate}
    vcf                 Analyse VCF inputs
    bam                 Run the BAM preprocessing workflow
    end-to-end (e2e)    Run the end-to-end workflow (Snakemake + vartracker)
    generate            Generate input spreadsheets from an existing directory of files
    describe-output     Print the output schema for results tables

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit

Use vartracker <subcommand> --help to inspect the full list of mode-specific arguments.

Installation Test

After installation you can verify the workflows using the bundled demonstration dataset:

vartracker vcf --test --outdir vartracker_vcf_test_results
vartracker bam --test --outdir vartracker_bam_test_results
vartracker end-to-end --test --outdir vartracker_e2e_test_results

Each command copies the example dataset, resolves relative paths, checks for the required external tools, and writes a self-contained set of results.

Output

vartracker produces several output files:

  • results.csv: Comprehensive variant analysis with all metrics
  • results_metadata.json: Output schema version and results metadata
  • new_mutations.csv: Mutations not present in the first sample
  • persistent_new_mutations.csv: New mutations that persist to the final sample
  • cumulative_mutations.pdf: Plot showing mutation accumulation over time
  • mutations_per_gene.pdf: Gene-wise mutation statistics
  • variant_allele_frequency_heatmap.html: Interactive heatmap with optional pokay annotations
  • variant_allele_frequency_heatmap.pdf: Heatmap of variant allele frequencies across passages
  • pokay_database_hits.*.csv: Functional annotation results (if pokay used)
  • run_metadata.json: Provenance manifest capturing inputs, tool versions, and run status

By default the manifest is lightweight. Use --manifest-level deep to checksum all referenced input files (FASTQ/BAM/VCF/coverage) and include file sizes.

Output schema

The results table schema is documented in docs/OUTPUT_SCHEMA.md. You can also print it from the CLI:

vartracker describe-output

To write the schema to a file instead, use:

vartracker describe-output --out docs/output_schema.csv
vartracker describe-output --out docs/output_schema.json --format json

What does vartracker do?

The pipeline performs the following analysis:

  1. VCF Standardization: Normalizes and standardizes input VCF files

  2. Annotation: Adds amino acid consequences using bcftools csq

  3. Variant Merging: Combines all longitudinal samples

  4. Comprehensive Analysis: For each variant, determines:

    • Gene location and amino acid consequences
    • Variant type (SNP/indel) and change type (synonymous/missense/etc.)
    • Persistence across samples (new/original, persistent/transient)
    • Quality control metrics
    • Amino acid property changes
    • Allele frequency dynamics
  5. Visualization: Generates plots for mutation accumulation and gene-wise statistics

  6. Functional Annotation: (optional) Searches against literature databases for known functional impacts

Citation

When using vartracker, please cite the software release you used. Citation metadata is provided in CITATION.cff, and GitHub releases are archived on Zenodo (DOI will appear here once minted).

Suggested software citation (replace the DOI/version with your release):

  • Foster, C. (2026). vartracker (Version 2.0.0). Zenodo. DOI: 10.5281/zenodo.XXXXXXX

Also cite relevant methods or data sources, for example:

  • Foster CSP, et al. Long-term serial passaging of SARS-CoV-2 reveals signatures of convergent evolution. Journal of Virology. 2025;99: e00363-25. doi:10.1128/jvi.00363-25
  • Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10. doi:10.1093/gigascience/giab008
  • Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017;33: 2037–2039. doi:10.1093/bioinformatics/btx100
  • Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012;40: 11189–11201. doi:10.1093/nar/gks918
  • Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34: i884–i890. doi:10.1093/bioinformatics/bty560
  • Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 2013 [cited 13 Apr 2021]. Available: https://arxiv.org/abs/1303.3997v2
  • Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10: 33. doi:10.12688/f1000research.29032.2

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any issues or have questions:

  1. Check the documentation
  2. Search existing issues
  3. Create a new issue with detailed information about your problem

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vartracker-2.0.0.tar.gz (13.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vartracker-2.0.0-py3-none-any.whl (13.5 MB view details)

Uploaded Python 3

File details

Details for the file vartracker-2.0.0.tar.gz.

File metadata

  • Download URL: vartracker-2.0.0.tar.gz
  • Upload date:
  • Size: 13.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vartracker-2.0.0.tar.gz
Algorithm Hash digest
SHA256 57a5b85898833b6f80400dccdbca4bee86d8ea148b43e2524550e3b4ab2a94f0
MD5 2f89fbcecc4440a874f0e9ecf641b9f1
BLAKE2b-256 d0c2e0b55b5c8d09b25f277fdb433d97e35aae0b4e9c9784d211ea76d94ac99f

See more details on using hashes here.

Provenance

The following attestation bundles were made for vartracker-2.0.0.tar.gz:

Publisher: publish-pypi.yml on charlesfoster/vartracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vartracker-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: vartracker-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vartracker-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b3df49317b7c1e46e46a2f4a7dd231e57a70dd8c7808ca6f60f0d58023843d1
MD5 42a819b28d671f7c8b060a3cfdc9c74a
BLAKE2b-256 852fe191484248bb1c4cd0471acb43a90c1c7de107f9e6b44e579a0622128b9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for vartracker-2.0.0-py3-none-any.whl:

Publisher: publish-pypi.yml on charlesfoster/vartracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page