Tool for the integration of viral consensus sequences obtained by de novo and mapping strategies, supported by prior information.
Project description
PriorCons
Prior‑guided consensus integration for viral genomes
🧭 Introduction
PriorCons improves viral consensus sequences by safely recovering missing information while preserving reliability.
The software integrates:
- A high‑confidence consensus sequence (FASTA) generated using a stringent pipeline. This sequence is trusted but may contain masked regions (Ns).
- The reference genome used during assembly.
- A candidate consensus sequence that is less conservative but potentially more informative (for example, produced with relaxed filtering or alternative assembly).
The objective is to fill gaps in the high‑confidence consensus using information from the candidate sequence — but only when supported by evolutionary evidence — so that coverage increases without introducing sequencing artefacts.
To achieve this, PriorCons uses evolutionary priors derived from large collections of genomes for the same virus or subtype aligned to the reference. These priors model expected variation and provide statistical thresholds that guide integration decisions.
📦 Installation
PriorCons can be installed via Conda (recommended for bioinformatics) or PyPI:
Using Conda
conda install -c bioconda priorcons
Using Pip
pip install priorcons
⚡ Quickstart + CLI Examples
Follow these steps to generate an integrated consensus using PriorCons.
1. Prepare the Priors Database
You need a collection of viral sequences (e.g., from GISAID or NCBI) relevant to your sample.
- Alignment is critical: Use MAFFT in reference-anchored mode (e.g.
--add --keeplength) to keep coordinates consistent when building priors. - Include the Reference: Ensure your reference sequence is included in this FASTA file.
2. Build the Priors
Run the build-priors command to create the empirical distribution of variation.
priorcons build-priors --input database_aligned.fasta --output virus_priors.json
3. Run integrate-consensus
Once you have the priors, align your three sequences (Trusted, Candidate, and Reference) and run the integration.
Alignment Recommendation: Since you are only aligning 3 sequences, use a high-sensitivity strategy. We recommend MAFFT with the following parameters:
mafft --localpair --maxiterate 1000 input.fasta > aligned_input.fasta
Running the integration:
priorcons integrate-consensus \
--aligned-fasta aligned_input.fasta \
--priors virus_priors.json \
--output integrated_consensus.fasta
🔬 Workflow Overview
PriorCons uses a window-based approach to statistically validate and fill gaps in viral assemblies.
- Slide overlapping windows across the genome.
- Detect windows with missing regions (Ns) in the trusted consensus.
- Evaluate the corresponding candidate window using the priors.
- Accept candidate window only if the score is evolutionarily plausible (below the statistical threshold).
- Produce an integrated consensus with increased completeness and maintained accuracy.
🧮 Methodology
1. Probability distributions per position
For each window of size $W$ bases, and each position $j$:
$$P_j(b)=\frac{c_j(b)+\alpha}{\sum_{x\in{A,C,G,T}}(c_j(x)+\alpha)}$$
Where:
- $c_j(b)$ is the count of base $b$.
- $\alpha$ is a pseudocount.
- Bases N are ignored.
2. Log‑likelihood of a sequence
Given a sequence $Q$:
$$\log L(Q \mid \text{window}) = \sum_j \log P_j(q_j)$$
Normalized negative log‑likelihood:
$$\text{nLL}(Q) = -\frac{1}{N_{\text{valid}}} \sum_j \log P_j(q_j)$$
Lower values indicate sequences consistent with expected variation.
3. Empirical thresholds
All sequences are scored to obtain an nLL distribution. The 95th percentile is used as a cutoff: windows exceeding this threshold are considered atypical and rejected during integration.
📊 Outputs
- Integrated consensus FASTA: The final integrated sequence.
- Window‑level QC trace: A file containing scores for each window.
- Summary QC metrics: Summary metrics regarding coverage and changes performed.
📚 Citing
This software was developed by Germán Vallejo Palma at the Instituto de Salud Carlos III (ISCIII) — National Centre of Microbiology, Respiratory Viruses and Influenza Unit.
If you use this software in a publication, report, or product, please cite the appropriate authors and include the above attribution.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file priorcons-0.1.4.tar.gz.
File metadata
- Download URL: priorcons-0.1.4.tar.gz
- Upload date:
- Size: 18.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c75580bb09ab44a35fd8bf634c24f59731624aca41b69f4eda5e49e54d56f27a
|
|
| MD5 |
ee897da9b2561793d1a9ba2b805b104c
|
|
| BLAKE2b-256 |
fd68da75b47bcf83a5128a0ff4f3adee3293219d99540bbef1566de6a0931635
|
File details
Details for the file priorcons-0.1.4-py3-none-any.whl.
File metadata
- Download URL: priorcons-0.1.4-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd936693b792ec6b2bbf846e2d939bd753e8f2bc33e33bf7c81af1f2db91885a
|
|
| MD5 |
29bb8ff7578175427740d83722c25303
|
|
| BLAKE2b-256 |
4c7f0c571e8cc97f3a7c54f95c770958578f690ce8eaa7a1fed71731ede43767
|