Skip to main content

Tool for the integration of viral consensus sequences obtained by de novo and mapping strategies, supported by prior information.

Project description

PriorCons

Prior‑guided consensus integration for viral genomes


🧭 Introduction

PriorCons improves viral consensus sequences by safely recovering missing information while preserving reliability.

The software integrates:

  • A high‑confidence consensus sequence (FASTA) generated using a stringent pipeline. This sequence is trusted but may contain masked regions (Ns).
  • The reference genome used during assembly.
  • A candidate consensus sequence that is less conservative but potentially more informative (for example, produced with relaxed filtering or alternative assembly).

The objective is to fill gaps in the high‑confidence consensus using information from the candidate sequence — but only when supported by evolutionary evidence — so that coverage increases without introducing sequencing artefacts.

To achieve this, PriorCons uses evolutionary priors derived from large collections of genomes for the same virus or subtype aligned to the reference. These priors model expected variation and provide statistical thresholds that guide integration decisions.


📦 Installation

PriorCons can be installed via Conda (recommended for bioinformatics) or PyPI:

Using Conda

conda install -c bioconda priorcons

View on Bioconda

Using Pip

pip install priorcons

View on PyPI


⚡ Quickstart + CLI Examples

Follow these steps to generate an integrated consensus using PriorCons.

1. Prepare the Priors Database

You need a collection of viral sequences (e.g., from GISAID or NCBI) relevant to your sample.

  • Alignment is critical: Use MAFFT in reference-anchored mode (e.g. --add --keeplength) to keep coordinates consistent when building priors.
  • Include the Reference: Ensure your reference sequence is included in this FASTA file.

2. Build the Priors

Run the build-priors command to create the empirical distribution of variation.

priorcons build-priors --input database_aligned.fasta --output virus_priors.json

3. Run integrate-consensus

Once you have the priors, align your three sequences (Trusted, Candidate, and Reference) and run the integration.

Alignment Recommendation: Since you are only aligning 3 sequences, use a high-sensitivity strategy. We recommend MAFFT with the following parameters:

mafft --localpair --maxiterate 1000 input.fasta > aligned_input.fasta

Running the integration:

priorcons integrate-consensus \
    --aligned-fasta aligned_input.fasta \
    --priors virus_priors.json \
    --output integrated_consensus.fasta

🔬 Workflow Overview

PriorCons uses a window-based approach to statistically validate and fill gaps in viral assemblies.

  1. Slide overlapping windows across the genome.
  2. Detect windows with missing regions (Ns) in the trusted consensus.
  3. Evaluate the corresponding candidate window using the priors.
  4. Accept candidate window only if the score is evolutionarily plausible (below the statistical threshold).
  5. Produce an integrated consensus with increased completeness and maintained accuracy.

🧮 Methodology

1. Probability distributions per position

For each window of size $W$ bases, and each position $j$:

$$P_j(b)=\frac{c_j(b)+\alpha}{\sum_{x\in{A,C,G,T}}(c_j(x)+\alpha)}$$

Where:

  • $c_j(b)$ is the count of base $b$.
  • $\alpha$ is a pseudocount.
  • Bases N are ignored.

2. Log‑likelihood of a sequence

Given a sequence $Q$:

$$\log L(Q \mid \text{window}) = \sum_j \log P_j(q_j)$$

Normalized negative log‑likelihood:

$$\text{nLL}(Q) = -\frac{1}{N_{\text{valid}}} \sum_j \log P_j(q_j)$$

Lower values indicate sequences consistent with expected variation.

3. Empirical thresholds

All sequences are scored to obtain an nLL distribution. The 95th percentile is used as a cutoff: windows exceeding this threshold are considered atypical and rejected during integration.


📊 Outputs

  • Integrated consensus FASTA: The final integrated sequence.
  • Window‑level QC trace: A file containing scores for each window.
  • Summary QC metrics: Summary metrics regarding coverage and changes performed.

📚 Citing

This software was developed by Germán Vallejo Palma at the Instituto de Salud Carlos III (ISCIII) — National Centre of Microbiology, Respiratory Viruses and Influenza Unit.

If you use this software in a publication, report, or product, please cite the appropriate authors and include the above attribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

priorcons-0.1.3.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

priorcons-0.1.3-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file priorcons-0.1.3.tar.gz.

File metadata

  • Download URL: priorcons-0.1.3.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for priorcons-0.1.3.tar.gz
Algorithm Hash digest
SHA256 e87add441b4486417cf6a43cf116c0d29d07f404964cc5c036e58c25328ac309
MD5 f9dd414e4f6b7a244b76f5d64f901529
BLAKE2b-256 720d84e8556d493be5d0c835ad11466eb9f0d686837d752b4723d8005dbbd2fd

See more details on using hashes here.

File details

Details for the file priorcons-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: priorcons-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for priorcons-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c5d7e5a0aa4990653e98ed5f36a40ea1d355c3dbab86827ac15eb2f6ad10895a
MD5 a18d26e54ff9ceb01fdfd3f0a913da5b
BLAKE2b-256 f9a26c98521d6a1fc1633de3d7d8bd22ad126d991621aea473c3e1d123dcba6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page