Skip to main content

Tool for the integration of viral consensus sequences obtained by de novo and mapping strategies, supported by prior information.

Project description

PriorCons

Prior‑guided consensus integration for viral genomes


🧭 Introduction

PriorCons improves viral consensus sequences by safely recovering missing information while preserving reliability.

The software integrates:

  • A high‑confidence consensus sequence (FASTA) generated using a stringent pipeline. This sequence is trusted but may contain masked regions (Ns).
  • The reference genome used during assembly.
  • A candidate consensus sequence that is less conservative but potentially more informative (for example, produced with relaxed filtering or alternative assembly).

The objective is to fill gaps in the high‑confidence consensus using information from the candidate sequence — but only when supported by evolutionary evidence — so that coverage increases without introducing sequencing artefacts.

To achieve this, PriorCons uses evolutionary priors derived from large collections of genomes for the same virus or subtype aligned to the reference. These priors model expected variation and provide statistical thresholds that guide integration decisions.


📦 Installation

PriorCons can be installed via Conda (recommended for bioinformatics) or PyPI:

Using Conda

conda install -c bioconda priorcons

View on Bioconda

Using Pip

pip install priorcons

View on PyPI


⚡ Quickstart + CLI Examples

Follow these steps to generate an integrated consensus using PriorCons.

1. Prepare the Priors Database

You need a collection of viral sequences (e.g., from GISAID or NCBI) relevant to your sample.

  • Alignment is critical: Use MAFFT in reference-anchored mode (e.g. --add --keeplength) to keep coordinates consistent when building priors.
  • Include the Reference: Ensure your reference sequence is included in this FASTA file.

2. Build the Priors

Run the build-priors command to create the empirical distribution of variation.

priorcons build-priors --input database_aligned.fasta --output virus_priors.json

3. Run integrate-consensus

Once you have the priors, align your three sequences (Trusted, Candidate, and Reference) and run the integration.

Alignment Recommendation: Since you are only aligning 3 sequences, use a high-sensitivity strategy. We recommend MAFFT with the following parameters:

mafft --localpair --maxiterate 1000 input.fasta > aligned_input.fasta

Running the integration:

priorcons integrate-consensus \
    --aligned-fasta aligned_input.fasta \
    --priors virus_priors.json \
    --output integrated_consensus.fasta

🔬 Workflow Overview

PriorCons uses a window-based approach to statistically validate and fill gaps in viral assemblies.

  1. Slide overlapping windows across the genome.
  2. Detect windows with missing regions (Ns) in the trusted consensus.
  3. Evaluate the corresponding candidate window using the priors.
  4. Accept candidate window only if the score is evolutionarily plausible (below the statistical threshold).
  5. Produce an integrated consensus with increased completeness and maintained accuracy.

🧮 Methodology

1. Probability distributions per position

For each window of size $W$ bases, and each position $j$:

$$P_j(b)=\frac{c_j(b)+\alpha}{\sum_{x\in{A,C,G,T}}(c_j(x)+\alpha)}$$

Where:

  • $c_j(b)$ is the count of base $b$.
  • $\alpha$ is a pseudocount.
  • Bases N are ignored.

2. Log‑likelihood of a sequence

Given a sequence $Q$:

$$\log L(Q \mid \text{window}) = \sum_j \log P_j(q_j)$$

Normalized negative log‑likelihood:

$$\text{nLL}(Q) = -\frac{1}{N_{\text{valid}}} \sum_j \log P_j(q_j)$$

Lower values indicate sequences consistent with expected variation.

3. Empirical thresholds

All sequences are scored to obtain an nLL distribution. The 95th percentile is used as a cutoff: windows exceeding this threshold are considered atypical and rejected during integration.


📊 Outputs

  • Integrated consensus FASTA: The final integrated sequence.
  • Window‑level QC trace: A file containing scores for each window.
  • Summary QC metrics: Summary metrics regarding coverage and changes performed.

📚 Citing

This software was developed by Germán Vallejo Palma at the Instituto de Salud Carlos III (ISCIII) — National Centre of Microbiology, Respiratory Viruses and Influenza Unit.

If you use this software in a publication, report, or product, please cite the appropriate authors and include the above attribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

priorcons-0.1.4.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

priorcons-0.1.4-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file priorcons-0.1.4.tar.gz.

File metadata

  • Download URL: priorcons-0.1.4.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for priorcons-0.1.4.tar.gz
Algorithm Hash digest
SHA256 c75580bb09ab44a35fd8bf634c24f59731624aca41b69f4eda5e49e54d56f27a
MD5 ee897da9b2561793d1a9ba2b805b104c
BLAKE2b-256 fd68da75b47bcf83a5128a0ff4f3adee3293219d99540bbef1566de6a0931635

See more details on using hashes here.

File details

Details for the file priorcons-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: priorcons-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for priorcons-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 cd936693b792ec6b2bbf846e2d939bd753e8f2bc33e33bf7c81af1f2db91885a
MD5 29bb8ff7578175427740d83722c25303
BLAKE2b-256 4c7f0c571e8cc97f3a7c54f95c770958578f690ce8eaa7a1fed71731ede43767

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page