Skip to main content

a quality control tool for bacterial ONT-only assemblies

Project description

Alpaqa: Assembly-Level Profiling And Quality Assessment

Alpaqa analyzes base-level quality scores in bacterial genome assemblies generated from Oxford Nanopore sequencing data.

The tool masks noisy regions and quantifies the number of Low-Quality Bases (LQBs) per megabase, a metric we found to correlate with overall base-level assembly accuracy. In addition, Alpaqa performs binomial tests to identify 4-, 5-, and 6-mer motifs that are significantly associated with LQBs. These motifs often represent systematic, error-prone sequence contexts and may correspond to targets of DNA modification systems.

Alpaqa operates on polished assemblies generated with Dorado polish or Medaka2. By detecting systematic error signatures, Alpaqa allows users to flag assemblies that may appear complete but contain motif-specific inaccuracies. This is particularly valuable in applications requiring high base-level accuracy, such as outbreak investigation and transmission analysis.

An automated bacterial genome assembly pipeline for ONT data which includes Medaka2 polishing, quality assessment with alpaqa, and masking of low-quality bases is available at boap.

Installation

Via pip

Alpaqa can be installed directly from PyPI:

pip install alpaqa-bio

Via uv (recommended)

If you use uv, you can run it without installation:

uvx --from alpaqa-bio alpaqa --help

Or install it:

uv tool install alpaqa-bio

Generating input files for alpaqa

Alpaqa relies on fastq assembly files with phred quality scores generated with dorado polish or medaka2 (-q flag). The tool has been tested with data generated with SUP@v5.0, SUP@v5.2, HAC@v5.0, and HAC@v5.2 basecalling models. SUP data is strongly recommended.

Using medaka2

READS=reads.fastq.gz
ASSEMBLY=draft_assembly.fasta
THREADS=16

# medaka2
medaka_consensus -i $READS -d $ASSEMBLY -o polished_assembly -t $THREADS --bacteria -q

Using dorado polish

READS=reads.fastq.gz
ASSEMBLY=draft_assembly.fasta
THREADS=16

dorado aligner $ASSEMBLY $READS | samtools sort -@ $THREADS -o aligned.bam
samtools index aligned.bam
dorado polish aligned.bam $ASSEMBLY -t $THREADS -o polished_assembly --bacteria -q

Alpaqa usage

Once installed, run alpaqa by passing your FASTQ assembly files:

alpaqa -i *.fastq --threads 16

Options

usage: alpaqa -i assembly.fastq -o output.tsv --threads 16 [options]

ALPAQA v0.1.1

options:
  -h, --help            show this help message and exit
  -i INPUT [INPUT ...], --input INPUT [INPUT ...]
                        Input fastq assembly file(s).
  -o OUTPUT, --output OUTPUT
                        Output TSV filename. (default: alpaqa_report.tsv)
  --report              Generate detailed reports on kmer analysis.
  -t THREADS, --threads THREADS
                        Number of threads. (default: 1)
  -v, --version         show program's version number and exit

Advanced:
  --all-contigs         Analyze ALL contigs > min_contig_len (Default is longest contig only).
  --no-mask             Disable masking of LQB dense regions.
  --lqb-threshold LQB_THRESHOLD
                        Phred Q-score threshold (inclusive). (default: 5)
  --min-contig-len MIN_CONTIG_LEN
                        Minimum contig length to include. (default: 50000)
  --window-size WINDOW_SIZE
                        Window size (bp) for scanning LQB dense regions. (default: 5000)
  --density-multiplier DENSITY_MULTIPLIER
                        Multiplier for baseline density to trigger masking. (default: 5.0)
  --mask-floor MASK_FLOOR
                        Absolute minimum LQB density (0.001 = 0.1%). (default: 0.001)

Output

Alpaqa generates a tab-separated (TSV) report. Each row represents one input assembly file with the following fields:

Field Description
Filename Name of the input FASTQ file.
AvgQ Average Phred quality score across all bases in the file.
LQB_raw/Mbp Low-Quality Bases per Megabase (Raw): Total count of bases with a Phred score between 1 and 5 in the entire file, normalized per million bases.
LQB/Mbp Low-Quality Bases per Megabase (Filtered): The density of low-quality bases in the longst contig after excluding (masking) unreliable LQB-dense regions.
MaskThresh Dynamic Masking Threshold: The density threshold that triggered masking. Calculated as 5x the baseline density, but is at least 0.1% (0.001) to prevent over-masking assemblies.
Masked_Bases Total number of bases removed from the analysis.
Contigs_Analyzed Shown as X//Y, where X is the number of contigs analyzed (default: longest contig only) and Y is the total number of contigs in the file.
Bases_Analyzed Total number of bases analyzed.
Sig4m / 5m / 6m Significant Motifs: The top 3 DNA patterns (4, 5, and 6-mers) most significantly linked to quality drops (Q1-5) based on binomial testing. Includes the motif and the percentage of its occurrences that were low quality.

Interpretation

For assemblies generated with SUP@v5.2 data, following thresholds were established:

  • <5 LQBs/Mbp: Highly accurate. These assemblies typically yield identical or near-identical cgMLST profiles when compared to Illumina-polished references.
  • 5 to 10 LQBs/Mbp: Potentially reliable but should be interpreted with caution. Masking low-quality bases with fastq2a.py (see below) is recommended for downstream cgMLST analyses.
  • >10 LQBs/Mbp: Usually unsuitable for high-resolution genotyping due to excessive base-level errors.

Beyond systematic sequencing errors, elevated LQB counts may indicate sample contamination, insufficient sequencing depth, or low read quality, which can be further investigated using tools such as nanoq and CheckM.

LQB_vs_errors

Masking low-quality bases for downstream genotyping

The helper script fastq2a.py can be used to convert FASTQ assembly files into FASTA format while masking low-quality bases. Replacing bases below a user-defined Q-score threshold with "N" helps maintain robustness in downstream bacterial cgMLST analyses. Tools such as Ridom SeqSphere exclude alleles containing these ambiguous bases to prevent false distance calculations. However, the final count of remaining target loci should be monitored to ensure sufficient resolution for high-resolution typing. For assemblies with ~5 to 10 LQBs/Mpb, masking bases with qscores ≤10 provides a good balance between accuracy and genomic resolution.

Note: if you installed via pip/uv, you can run this helper script using:

python -m alpaqa.fastq2a -i assembly.fastq -o assembly.masked.fasta -q 10

License

GNU General Public License, version 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alpaqa_bio-0.1.1.tar.gz (21.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alpaqa_bio-0.1.1-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file alpaqa_bio-0.1.1.tar.gz.

File metadata

  • Download URL: alpaqa_bio-0.1.1.tar.gz
  • Upload date:
  • Size: 21.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for alpaqa_bio-0.1.1.tar.gz
Algorithm Hash digest
SHA256 63d6bcd46cc724d47cf8cbd680c4189e60783a377c3dff990334d52162cbb6b8
MD5 5050934777c10dd9b498e0e002977fe9
BLAKE2b-256 c961f1c1f0677c446e07f01382121e3b271c52890d8941ae706d5dc2ff7e8abb

See more details on using hashes here.

File details

Details for the file alpaqa_bio-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: alpaqa_bio-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for alpaqa_bio-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3749225bfd6c8c8e92a7e31314371ab3733d2292011dc71676ce6eb8805e6708
MD5 83c52c4bd60e3a0a4febb3231ddcae60
BLAKE2b-256 578677c6a26c95b35f33628612df456365892a08b39a669ca9becc73a29ecbb2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page