Tools for consensus sequence assembly and frameshift correction (mixedAssembly)

Project description

Viral Mixed Assembly Pipeline

This repository provides tools to:

Run a mixed assembly pipeline (run_mixed_assembly.py) that merges IRMA and ABACAS information, applies quality control at the sliding-window level, and restores IRMA-specific insertions relative to the reference.
Build empirical priors (build_priors.py) from large multiple-sequence alignments, used later to evaluate windows during mixed assembly.
Remove frameshifts from an alignment with the reference (remove_frameshifts.py)
Provide supporting functions via utils scripts.

Installation

pip install mixedassembly

CLI usage

# Create priors
mixedassembly build-priors --input sequences.fasta --ref REF_ID --output priors.parquet

# Run mixed assembly
mixedassembly run-mixed-assembly --input alignment.aln --ref REF_ID --prior priors.parquet --output_dir results

# Correct frameshifts
mixedassembly remove-frameshifts --aln consensus.aln --out corrected.fasta

🚀 Main Script: run_mixed_assembly.py

This is the entrypoint of the pipeline. It creates a mixed consensus sequence by combining IRMA and ABACAS outputs, both aligned to a reference sequence, but only after performing quality control (QC) at the window level.

🔑 Inputs

--input → path to an alignment file (.aln) containing at least:
- the reference sequence
- the IRMA consensus sequence
- the ABACAS consensus sequence
--ref → ID of the reference sequence in the alignment.
--prior → path to a priors table (.parquet) generated with build_priors.py.
--output_dir → directory to save the results.

🧪 Workflow

Filter alignment by reference coordinates → all sequences are trimmed to the reference coordinate system.
Create overlapping windows → sliding windows across the genome (size and overlap defined in the priors file).
Window QC →
- Check if ABACAS window is more informative than IRMA, and if it passes some basic qc assumptions (<50 Ns,only 1 or 2 "fragments")
- For each valid window,compute the normalized negative log-likelihood (nLL) of the ABACAS sequence against the prior distribution. Thus, evaluating how rare is the substitution pattern in the window compared to the prior information.
- Compare to the nLL_p95 threshold in the priors table.
- If ABACAS passes QC → keep ABACAS bases for that window.
- Otherwise → keep IRMA bases.
Consensus construction → merge windows into a full consensus sequence.
Insertion restoration → insertions that IRMA had relative to the reference (lost when filtering by reference coordinates) are reinserted into the final consensus.
QC reporting → compute coverage, substitutions, and insertion metrics comparing the final mixed consensus to IRMA.

📦 Outputs

The script produces three files inside --output_dir:

Mixed consensus FASTA
- File: <basename>-MIX_ASSEMBLY.fasta
- Contains the final consensus sequence after merging and reinserting insertions.
Window QC trace (CSV)
- File: windows_trace.csv
- One row per window, recording:
  - start, end → genomic coordinates.
  - MISSING_IRMA, MISSING_ABACAS → counts of missing bases.
  - ABACAS_MORE_INFO → whether ABACAS has fewer missing bases than IRMA.
  - ABACAS_FRAGMENTS → fragmentation level of ABACAS in this window (keep: 0 < n fragments < 3 ).
  - WINDOW_PRIOR_nLL_p95 → threshold from priors.
  - WINDOW_SCORE_nLL → score of ABACAS in this window.
  - WINDOW_QC_PASSED → True/False decision.
Consensus QC summary (JSON)
- File: qc.json
- Provides overall metrics comparing the IRMA consensus and the mixed consensus:
  - IRMA_COVERAGE → % of genome covered in IRMA.
  - MIXED_COVERAGE → % of genome covered in mixed consensus.
  - IRMA_SUBSTITUTIONS → substitutions vs. reference in IRMA.
  - MIXED_SUBSTITUTIONS → substitutions vs. reference in mixed consensus.
  - EXPECTED_SUBSTITUTIONS → expected number of substitutions, extrapolated from IRMA.
  - OBS-EXP_SUBSTITUTIONS → difference between observed and expected substitutions.
  - N_INSERTIONS → number of insertions added back.
  - TOTAL_INSERTIONS_LENGTH → total inserted length.
  - INSERTIONS → list of insertions with their coordinates.

▶️ Example run

python run_mixed_assembly.py
--input /path/to/250694-RSVWGS.aln
--ref RSV_BD
--prior /path/to/RSVBD_win100_ovlp50_priors.parquet
--output_dir results

This will generate:

results/250694-RSVWGS-MIX_ASSEMBLY.fasta
results/windows_trace.csv
results/qc.json

🛠 Script: build_priors.py

This script creates empirical priors (overlapped windows) from a large multiple sequence alignment.
These priors are later used by run_mixed_assembly.py to evaluate windows.

🔑 Inputs

-i / --input → aligned FASTA file with multiple sequences.
-r / --ref → ID of the reference sequence.
-o / --output → output file (.parquet).
--win → window size (default: 100).
--overlap → overlap size (default: 10).

▶️ Example run

python build_priors.py
-i alignment.fasta
-r ReferenceID
-o priors.parquet
--win 100
--overlap 10

📦 Output

A .parquet file with one row per window, containing:

start, end → window coordinates.
nLL_p95, nLL_p99 → empirical thresholds.
profile → base probability distributions for each position in the window.

🧮 Methodology (build_priors.py)

1. Probability distributions per position

For each window of size W bases (e.g., W = 100), and for each position j within that window, we compute the probability of observing each nucleotide:

P_j(b)

Where:

= number of sequences with base at position .
$\alpha$ = pseudocount (Laplace smoothing, default $\alpha=1$ ) to avoid zero probabilities.
Bases N are ignored in the counts.

This gives a per-position categorical distribution.

2. Log-likelihood of a sequence in a window

Given a query sequence , we compute its probability under the window profile.
For each valid (non-N) position with observed base q_j :

logL

The normalized negative log-likelihood (nLL) is:

nLL

Where:

= number of positions in the window where has a non-N base.

Smaller nLL values indicate sequences more likely under the empirical profile.

3. Empirical priors

To characterize "normal variation" for each window:

Score all sequences from the alignment against the window profile.
Collect the distribution of nLL values.
Extract percentiles (e.g., 95th and 99th) to serve as thresholds.

Thus, for each window we store:

The distribution (profile).
Empirical thresholds: nLL_p95 and nLL_p99.

A new sequence can later be compared:

If nLL < nLL_p95 → typical.
If nLL > nLL_p99 → unusually variable, possibly unreliable region.

� Supporting utils

Several utility scripts provide reusable functions for both processes:

utils.py → basic alignment and scoring functions:
- load_alignment, extract_ref_positions, sliding_windows, score_window.
utils_mixed_assembly.py → additional helpers for mixed assembly:
- missingness and fragmentation counts,
- insertion handling,
- QC calculations,
- consensus merging,
- window evaluation wrapper.

These modular functions keep the pipeline clean and reusable.

🧹 Script: remove_frameshifts.py

This script is used to detect and mask problematic gaps in a consensus sequence aligned against a reference.
Frameshifts appear when deletions are not multiples of 3 or when gaps are too long, which can shift the reading frame.

The algorithm replaces those regions with Ns (and optionally adds padding on the sides), preventing errors in downstream analyses.

🔑 Inputs

--aln → MSA FASTA file (must include reference and consensus).
--ref-index → index of the reference in the alignment (default: 0).
--cons-index → index of the consensus in the alignment (default: 1).
--min-gap → minimum gap length (multiple of 3) that triggers aggressive masking (default: 15).
--pad → number of additional bases to mask around the gap (default: 0).
--out → output FASTA file with the corrected consensus.

📦 Output

FASTA file with the masked consensus → long gaps or gaps not multiple of 3 are replaced by Ns.

▶️ Example run

Assume we have an alignment example.aln.fasta with two sequences:

index 0: reference
index 1: preliminary consensus

python remove_frameshifts.py \
  --aln example.aln.fasta \
  --ref-index 0 \
  --cons-index 1 \
  --min-gap 12 \
  --pad 0 \
  --out consensus_fixed.fasta

🔍 Example input (simplified):

Reference
ATGGCTTACG CTGGA CTG

Preliminary consensus
ATGGCTTACG-TGGA------G

Here we see:

A deletion of 1 bases → not a multiple of 3 → potential frameshift.
A deletion of 6 bases → a multiple of 3 and < min gap → leave gap.

📤 Output (`consensus_fixed.fasta`):

consensus_masked ATGGCTTACGNTGGA---G

Thus, frameshifts are corrected by masking with Ns, preserving the reading frame validity.

Project details

Release history Release notifications | RSS feed

0.2.1

Oct 21, 2025

0.2.0

Oct 17, 2025

This version

0.1.4

Oct 3, 2025

0.1.3

Oct 3, 2025

0.1.2

Oct 3, 2025

0.1.1

Sep 30, 2025

0.1.0

Sep 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mixedassembly-0.1.4.tar.gz (20.7 kB view details)

Uploaded Oct 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mixedassembly-0.1.4-py3-none-any.whl (19.8 kB view details)

Uploaded Oct 3, 2025 Python 3

File details

Details for the file mixedassembly-0.1.4.tar.gz.

File metadata

Download URL: mixedassembly-0.1.4.tar.gz
Upload date: Oct 3, 2025
Size: 20.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for mixedassembly-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`0727eca57449aebec7ff53a2bf11d96af3675bc9c5fccdab9b444c427e195573`
MD5	`202509c1410217ea41483082891e2de1`
BLAKE2b-256	`324b59c43433ba8f5f910f57b2bca1674fed94f0a0b3a20039c3b20656629b29`

See more details on using hashes here.

File details

Details for the file mixedassembly-0.1.4-py3-none-any.whl.

File metadata

Download URL: mixedassembly-0.1.4-py3-none-any.whl
Upload date: Oct 3, 2025
Size: 19.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for mixedassembly-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`596819986ff01fa17ca9aaceee8248f79fc198c9a07b471e6b768577906becb3`
MD5	`bbdc292e900792fedea8f3ba3ac7a774`
BLAKE2b-256	`3dbdec2505af8305595814758ad9ad1034982afcb7c98f6f1d982891ebee3e19`

See more details on using hashes here.

mixedassembly 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Viral Mixed Assembly Pipeline

Installation

CLI usage

🚀 Main Script: run_mixed_assembly.py

🔑 Inputs

🧪 Workflow

📦 Outputs

▶️ Example run

python run_mixed_assembly.py --input /path/to/250694-RSVWGS.aln --ref RSV_BD --prior /path/to/RSVBD_win100_ovlp50_priors.parquet --output_dir results

🛠 Script: build_priors.py

🔑 Inputs

▶️ Example run

python build_priors.py -i alignment.fasta -r ReferenceID -o priors.parquet --win 100 --overlap 10

📦 Output

🧮 Methodology (build_priors.py)

1. Probability distributions per position

2. Log-likelihood of a sequence in a window

3. Empirical priors

� Supporting utils

🧹 Script: remove_frameshifts.py

🔑 Inputs

📦 Output

▶️ Example run

🔍 Example input (simplified):

📤 Output (consensus_fixed.fasta):

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

python run_mixed_assembly.py
--input /path/to/250694-RSVWGS.aln
--ref RSV_BD
--prior /path/to/RSVBD_win100_ovlp50_priors.parquet
--output_dir results

python build_priors.py
-i alignment.fasta
-r ReferenceID
-o priors.parquet
--win 100
--overlap 10

📤 Output (`consensus_fixed.fasta`):