Skip to main content

A Python package for curating virus genome alignments and phylogenies and flagging QC issues

Project description

raccoon

raccoon logo

Rigorous Alignment Curation: Cleanup Of Outliers and Noise

Raccoon is a lightweight toolkit for alignment and phylogenetic QC workflows. It identifies problematic sites (e.g., clustered SNPs, SNPs near Ns/gaps, and frame‑breaking indels) and produces mask files and summaries for downstream analyses.


Contents

Use cases

  • Flag clustered SNPs that may indicate contamination, recombination, or misalignment.
  • Detect SNPs adjacent to low-coverage regions (Ns) or gaps.
  • Identify frame-breaking indels in coding regions using a GenBank reference.
  • Generate mask files to exclude suspect sites prior to phylogenetic or evolutionary analyses.

Installation

From source:

pip install .

For development (editable install):

pip install -e .

Quickstart

CLI usage

Show help:

raccoon --help

Sequence QC (seq-qc)

Basic usage:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta

With metadata-driven headers:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
  -m metadata.csv other_metadata.csv \
  --metadata-id-field sample \
  --metadata-location-field location \
  --metadata-date-field date \
  --header-separator '|'

With a custom header template:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
  -m metadata.csv --header-fields "{id}|{country}|{date}"

Key options:

  • -m, --metadata: metadata CSV file(s) for header harmonisation
  • --metadata-delimiter: metadata delimiter (default ,; .tsv auto-detected)
  • --metadata-id-field: metadata ID column (default: sample)
  • --metadata-location-field: metadata location column (default: location)
  • --metadata-date-field: metadata date column (default: date)
  • --header-fields: template for custom headers (e.g. {id}|{country}|{date})
  • --header-separator: separator used for non-template harmonised headers (default: |)
  • --seq-id-delimiter: delimiter for parsing IDs from input headers (default: |)
  • --seq-id-field-index: 0-based field index for parsed sequence ID (default: 0)
  • --min-length: minimum sequence length to keep
  • --max-n-content: maximum N-content proportion to keep

Alignment QC (aln-qc)

Basic usage:

raccoon aln-qc <alignment.fasta> -d outdir

With GenBank reference for frame-break checks:

raccoon aln-qc <alignment.fasta> -d outdir \
  --genbank <reference.gb> --reference-id <ref_id>

Disable selected flag classes:

raccoon aln-qc <alignment.fasta> -d outdir \
  --no-flag-n-adjacent --no-flag-gap-adjacent

Key options:

  • --max-n-content: N-content threshold for flagging
  • --cluster-window: window size (bp) for clustered SNP detection
  • --cluster-count: minimum SNPs in-window to mark as clustered
  • --no-flag-clustered: skip clustered SNP flagging
  • --no-flag-n-adjacent: skip N-adjacent SNP flagging
  • --no-flag-gap-adjacent: skip gap-adjacent SNP flagging
  • --no-flag-frame-break: skip frame-breaking indel flagging
  • --flag-removal-threshold: mark sequence for removal above this flagged-site count

Apply mask (mask)

raccoon mask <alignment.fasta> \
  --mask-file results/alignment_qc/mask_sites.csv \
  -d results/alignment_qc

Key options:

  • --mask-file: mask CSV file from aln-qc
  • --mask-character: character to use for masking (default: ?)
  • -o, --outfile: output masked alignment file name
  • -d, --outdir: output directory
  • -t, --sequence-type: nt or aa (default: nt)

Phylogenetic QC (tree-qc)

Basic usage:

raccoon tree-qc --tree <treefile> -d outdir \
  --alignment <alignment.fasta> --asr-state <treefile>.state \
  --run-adar --adar-window 300 --adar-min-count 3

Key options:

  • -t, --tree: input phylogeny file (required)
  • --tree-format: auto, newick, or nexus
  • --assembly-refs: assembly/reference FASTA used for mapping
  • --outgroup-ids: comma-separated outgroup sequence IDs
  • --mask-file: optional mask CSV with sites to ignore
  • --tip-fields: template for parsing tip-label fields
  • --tip-field-delimiter: delimiter used for tip field parsing
  • --tip-date-field: field name treated as date in tip parsing
  • --midpoint-root: midpoint-root tree for report visualisation (ignored with --asr-state)
  • --long-branch-sd: SD threshold for long-branch flagging
  • --run-apobec: run APOBEC3 checks
  • --run-adar: run ADAR checks

See full CLI details in [docs/cli.md](docs/cli.md).

## Mask notes

Mask output uses the following note values:

| Note | Meaning |
| --- | --- |
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |

## Example data

The [examples](examples) folder includes a constructed alignment and GenBank reference suitable for quick testing:

- [examples/constructed_alignment.fasta](examples/constructed_alignment.fasta)
- [examples/constructed_reference.gb](examples/constructed_reference.gb)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artic_raccoon-1.0.2.tar.gz (88.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

artic_raccoon-1.0.2-py3-none-any.whl (78.5 kB view details)

Uploaded Python 3

File details

Details for the file artic_raccoon-1.0.2.tar.gz.

File metadata

  • Download URL: artic_raccoon-1.0.2.tar.gz
  • Upload date:
  • Size: 88.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for artic_raccoon-1.0.2.tar.gz
Algorithm Hash digest
SHA256 b97d5c494d383d0e8aa235cf5a8520a03f3118969c282959ef64262abbcf403a
MD5 683eaec42d61bbe964c3e83530f04401
BLAKE2b-256 0172368936522217abdc759ab6c0789a8121f5b85aa51380091762f29bd5b9ed

See more details on using hashes here.

File details

Details for the file artic_raccoon-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: artic_raccoon-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 78.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for artic_raccoon-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a4e86d0d8895a13def01d7fc4b03d52bca54a282b13797e89ce526afd29fb5b5
MD5 e1c7147ed24c5f5cc48b1996cf6cee55
BLAKE2b-256 dd8162051d1c705cec841a5a6cd789021b7765e11224e6d637f5985e88639a78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page