Skip to main content

Run assembler (Canu, Flye, Hifiasm) on a set of long read files

Project description

https://badge.fury.io/py/sequana-lora.svg JOSS (journal of open source software) DOI https://github.com/sequana/lora/actions/workflows/main.yml/badge.svg Codacy grade

LORA — Long Read Assembly pipeline

Overview:

Assemble long reads (PacBio HiFi, PacBio subreads, Nanopore) into high-quality genome assemblies with optional polishing, annotation, and quality assessment.

Input:

BAM files from PacBio sequencers, or FastQ files from Nanopore or PacBio HiFi sequencers.

Output:

HTML reports with per-sample assembly statistics, coverage, BLAST identification, BUSCO scores, and optional annotation.

Status:

Production

Citation:
Cokelaer et al, (2017), ‘Sequana’: a Set of Snakemake NGS pipelines,

Journal of Open Source Software, 2(16), 352, doi:10.21105/joss.00352

Zenodo DOI:
https://zenodo.org/badge/DOI/10.5281/zenodo.19330782.svg
Biorxiv:

https://www.biorxiv.org/content/10.64898/2026.01.06.697901v1

Pipeline DAG

Installation

pip install sequana-lora

To upgrade an existing installation:

pip install sequana-lora --upgrade

Quick Start

Step 1 — prepare the working directory:

sequana_lora \
    --input-directory /path/to/reads \
    --data-type pacbio-hifi \
    --assembler flye \
    --genome-size 3m \
    --apptainer-prefix /path/to/containers

This creates a lora/ working directory containing config.yaml and a lora.sh launch script.

Step 2 — review the configuration (optional but recommended):

cd lora
cat config.yaml   # adjust parameters as needed

Step 3 — run the pipeline:

sh lora.sh

Or launch directly from step 1 with --execute (skips the review step):

sequana_lora ... --execute

To watch live progress in the terminal, add --monitor:

sequana_lora ... --execute --monitor

Required options

Three options are always required:

--assembler

Assembler to use. Choices: flye (recommended for HiFi), canu, hifiasm, unicycler, necat, pecat.

--data-type

Technology and quality of the input reads:

Value

Description

pacbio-hifi

PacBio HiFi / CCS reads (Q20+)

pacbio-raw

PacBio CLR / subreads (raw)

pacbio-corr

PacBio corrected reads

nano-hq

Nanopore Q20+ reads (e.g. R10.4 with SUP basecalling)

nano-raw

Nanopore standard reads

nano-corr

Nanopore corrected reads

--genome-size

Estimated genome size, e.g. 3m (3 Mb), 2.5g (2.5 Gb). Required by Flye; used by Canu for coverage reporting.

Common Examples

Nanopore (bacteria, full quality pipeline)

Use --mode bacteria to enable sequana_coverage, prokka, busco, and checkm in one shot:

sequana_lora \
    --input-directory /data/nanopore \
    --data-type nano-hq \
    --assembler flye \
    --genome-size 3m \
    --apptainer-prefix /shared/containers \
    --mode bacteria \
    --busco-lineage bacteria \
    --checkm-rank genus \
    --checkm-name Streptococcus

PacBio subreads (with CCS construction)

If your input is raw PacBio BAM files (subreads), LORA can build CCS/HiFi reads first:

sequana_lora \
    --input-directory /data/subreads \
    --data-type pacbio-raw \
    --assembler flye \
    --genome-size 3m \
    --pacbio-build-ccs \
    --pacbio-ccs-min-passes 10 \
    --pacbio-ccs-min-rq 0.99

Or, if you have multiple BAM files per sample, provide a CSV:

sequana_lora \
    --pacbio-input-csv samples.csv \
    --data-type pacbio-raw \
    --assembler flye \
    --genome-size 3m

The CSV format is: one row per sample with columns sample,file1[,file2,...].

Optional Steps

Coverage analysis

Computes depth of coverage and breadth of coverage for each contig using sequana_coverage. Highly recommended to check assembly quality:

--do-coverage

BUSCO completeness

Assess genome completeness against a lineage-specific marker gene set:

--busco-lineage bacteria          # auto-download bacteria lineage
--busco-lineage streptococcales   # specific clade
--busco-print-lineages            # list all available lineages

Prokka annotation

Annotate contigs (bacterial genomes):

--do-prokka

CheckM genome quality

Estimate completeness and contamination for bacterial genomes:

--checkm-rank genus --checkm-name Streptococcus

Use an invalid --checkm-name value to get a list of valid names for a given rank, e.g. --checkm-rank genus --checkm-name HELP.

Polypolish (Illumina polishing)

Polish long-read contigs with paired-end Illumina data:

--do-polypolish \
--polypolish-input-directory /data/illumina \
--polypolish-input-pattern "*.fastq.gz" \
--polypolish-input-readtag "_R[12]_"

Circularisation

Explicit circularisation with Circlator (Flye performs this automatically):

--do-circlator

BLAST identification

BLAST aligns each contig against a nucleotide database to identify the assembled sequences. The top hits appear in the HTML report.

Local BLAST

Requires a locally installed BLAST+ and a downloaded nt database (~270 GB). Fastest option with no network dependency:

--blastdb /path/to/blast/databases

Remote BLAST (NCBI)

No local database required — jobs are submitted to NCBI’s BLAST servers. Enable by providing an email address:

--blast-email your@email.com

Jobs are submitted sequentially (one contig at a time) to avoid IP-level CPU throttling by NCBI. The default database is nt; use --blast-remote-db to change it:

--blast-email your@email.com --blast-remote-db refseq_genomic

Restricting the search to an organism group (strongly recommended)

Searching all of nt for a large contig is slow and prone to NCBI CPU throttling. Restrict the search by editing config.yaml after running sequana_lora. The entrez_query parameter is equivalent to filling the “Organism” box on the NCBI BLAST web form.

Option 1 — curated bacterial reference genomes (fastest, recommended for bacteria)

Use refseq_genomic as the database and restrict to bacteria. RefSeq contains only complete, curated reference genomes — a much smaller and higher-quality search space than nt:

blast:
    remote_db: 'refseq_genomic'
    entrez_query: 'Bacteria[Organism]'

Option 2 — all bacteria in nt, reference sequences only

Stay on nt but filter to RefSeq-quality entries:

blast:
    remote_db: 'nt'
    entrez_query: 'Bacteria[Organism] AND refseq[filter]'

Option 3 — single genus

Useful when the organism is known:

blast:
    remote_db: 'nt'
    entrez_query: 'Streptococcus[Organism]'

NCBI API key (optional but recommended)

Register for a free NCBI API key at https://www.ncbi.nlm.nih.gov/account/ (sign in → API Key Management). It raises the rate limit from 3 to 10 requests/second and reduces CPU throttling for large queries. Add it to config.yaml:

blast:
    api_key: 'YOUR_KEY_HERE'

HPC / SLURM cluster

On a cluster with SLURM, pass --profile slurm:

sequana_lora \
    --input-directory /data/hifi \
    --data-type pacbio-hifi \
    --assembler flye \
    --genome-size 3m \
    --profile slurm \
    --slurm-queue fast \
    --jobs 40 \
    --apptainer-prefix /shared/containers

Per-rule memory and thread settings are controlled via the resources blocks in config.yaml.

Apptainer / Singularity (no system installs needed)

Every tool runs inside a pre-built container. Point --apptainer-prefix to a shared directory so images are downloaded once and reused across projects:

--apptainer-prefix /shared/containers

Images are downloaded automatically on first run from Zenodo. Pass extra bind mounts with --apptainer-args if your data lives outside $HOME:

--apptainer-args "-B /data:/data"

Configuration file

After running sequana_lora, a config.yaml is created in the working directory. All pipeline parameters can be tuned there. Key sections:

  • assembler — which assembler to use

  • flye / canu / hifiasm — assembler-specific options

  • fastp — read filtering (minimum length, etc.)

  • blast — BLAST settings including entrez_query and api_key

  • busco / prokka / checkm — optional QC tools

  • sequana_coverage — coverage analysis parameters

  • multiqc — aggregated report settings

Full reference: config.yaml

Pipeline overview

  1. Read filtering — fastp removes reads below the minimum length threshold.

  2. [Optional] CCS — build HiFi reads from PacBio subreads (ccs tool).

  3. Assembly — Flye / Canu / Hifiasm / Unicycler / NECAT / PECAT.

  4. [Optional] Circularisation — Circlator (or built into Flye).

  5. [Optional] Polishing — Polypolish with paired-end Illumina reads.

  6. Contig sorting — SeqKit sorts contigs by length (largest first).

  7. Read mapping — Minimap2 maps reads back to contigs; Mosdepth computes coverage.

  8. [Optional] Coverage analysis — sequana_coverage per contig.

  9. Quality assessment — QUAST assembly statistics.

  10. [Optional] BLAST — top hits per contig (local or remote NCBI).

  11. [Optional] BUSCO — genome completeness.

  12. [Optional] Prokka — genome annotation.

  13. [Optional] CheckM — contamination and completeness for bacteria.

  14. Reports — per-sample HTML report and a multi-sample summary.

Changelog

Version

Description

1.1.0

  • remote BLAST via NCBI URL API (no local database needed); sequential submission to avoid IP-level CPU throttling

  • entrez_query support to restrict BLAST to a taxonomic group (e.g. Bacteria[Organism], refseq_genomic) — equivalent to the “Organism” box on the NCBI BLAST web form

  • NCBI API key support for higher rate limits (blast.api_key)

  • bioservices >= 1.16.0 required for NCBIBlastAPI

  • retry logic when NCBI returns READY with empty result set

  • improved HTML reports: Sequana logo in header, back-to-summary button, FASTA download link per sample, GC content in coverage table, informative amber warning box when BLAST returns no hits

  • update busco container to busco_6.0.0

  • fix sequana_coverage log redirection (was showing 2 s in monitor)

  • updated README with BLAST, entrez_query and refseq_genomic docs

1.0.0

  • uniformised extension with other pipelines. fix regression on schema file

  • update sequana container to v0.16.5

  • add unicycler apptainer

  • add checkm module to help users choosing correct marker and name

  • replaces –pacbio and –nanopore with –data-type. pacbio is now decomposed into 3 sub-categories: pacbio-raw, pacbio-hifi and pacbio-corr

  • add bandage if assembly graph is available

  • fixed hifiasm container to use newest version

  • improved report html

  • make genome-size compulsory

  • add fastp as preprocessing tool

  • remove presets in favor of click options

  • CCS defaults to hifi. pacbio presets in config set to pacbio-hifi

  • blast removed from default; users must set blast DB themselves

  • busco lineage downloaded from the web

  • CANU preset changes: pacbio → pacbio-hifi

  • CANU-correction preset changes: pacbio → pacbio-hifi

  • FLYE preset changes: pacbio-raw → pacbio-hifi

  • remote BLAST via NCBI URL API (no local database needed)

  • entrez_query support to restrict BLAST to a taxonomic group

  • NCBI API key support for higher rate limits

0.3.0

  • Use click instead of argparse

  • added multiqc / checkm / unicycler

0.2.0

  • add apptainers in most rules

  • remove utils.smk to move rulegraph inside main pipeline

  • rename lora.smk into lora.rules for consistency with other pipelines

  • add checkm in the pipeline and HTML report

0.1.0

First release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sequana_lora-1.1.0.tar.gz (192.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sequana_lora-1.1.0-py3-none-any.whl (192.6 kB view details)

Uploaded Python 3

File details

Details for the file sequana_lora-1.1.0.tar.gz.

File metadata

  • Download URL: sequana_lora-1.1.0.tar.gz
  • Upload date:
  • Size: 192.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.10.14 Linux/6.14.5-100.fc40.x86_64

File hashes

Hashes for sequana_lora-1.1.0.tar.gz
Algorithm Hash digest
SHA256 e8ce39b680195bfd3ef4b779e59313f11a1c4fc422c767ec95833cda8cdd8c8d
MD5 a68d55df35a5b96f9ca59846eae13d02
BLAKE2b-256 b7569fa65e2a82a924f701f25cc4fc99dbc4b3f79d1929420d6b63c551389b5b

See more details on using hashes here.

File details

Details for the file sequana_lora-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: sequana_lora-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 192.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.10.14 Linux/6.14.5-100.fc40.x86_64

File hashes

Hashes for sequana_lora-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a105eb8f2491a5a6115570554c1b3f8d438253a4be74ec09c747a7d89fd0d492
MD5 317df9d8ae8f1f635e64a1d28e435d71
BLAKE2b-256 311688ad486f1347bdb99220dfd9e053c13d39d746e7138f7f48a6f153e8ce5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page