Skip to main content

A Python package for curating virus genome alignments and phylogenies and flagging QC issues

Project description

raccoon

raccoon logo

Rigorous Alignment Curation: Cleanup Of Outliers and Noise

Raccoon is a lightweight toolkit for alignment and phylogenetic QC workflows. It identifies problematic sites (e.g., clustered SNPs, SNPs near Ns/gaps, and frame‑breaking indels) and produces mask files and summaries for downstream analyses.


Contents

Use cases

  • Flag clustered SNPs that may indicate contamination, recombination, or misalignment.
  • Detect SNPs adjacent to low-coverage regions (Ns) or gaps.
  • Identify frame-breaking indels in coding regions using a GenBank reference.
  • Generate mask files to exclude suspect sites prior to phylogenetic or evolutionary analyses.

Installation

From source:

pip install .

For development (editable install):

pip install -e .

Quickstart

CLI usage

Show help:

raccoon --help

Sequence QC (seq-qc)

Basic usage:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta

With metadata-driven headers:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
  -m metadata.csv other_metadata.csv \
  --metadata-id-field sample \
  --metadata-location-field location \
  --metadata-date-field date \
  --header-separator '|'

With a custom header template:

raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
  -m metadata.csv --header-fields "{id}|{country}|{date}"

Key options:

  • -m, --metadata: metadata CSV file(s) for header harmonisation
  • --metadata-delimiter: metadata delimiter (default ,; .tsv auto-detected)
  • --metadata-id-field: metadata ID column (default: sample)
  • --metadata-location-field: metadata location column (default: location)
  • --metadata-date-field: metadata date column (default: date)
  • --header-fields: template for custom headers (e.g. {id}|{country}|{date})
  • --header-separator: separator used for non-template harmonised headers (default: |)
  • --seq-id-delimiter: delimiter for parsing IDs from input headers (default: |)
  • --seq-id-field-index: 0-based field index for parsed sequence ID (default: 0)
  • --min-length: minimum sequence length to keep
  • --max-n-content: maximum N-content proportion to keep

Alignment QC (aln-qc)

Basic usage:

raccoon aln-qc <alignment.fasta> -d outdir

With GenBank reference for frame-break checks:

raccoon aln-qc <alignment.fasta> -d outdir \
  --genbank <reference.gb> --reference-id <ref_id>

Disable selected flag classes:

raccoon aln-qc <alignment.fasta> -d outdir \
  --no-flag-n-adjacent --no-flag-gap-adjacent

Key options:

  • --max-n-content: N-content threshold for flagging
  • --cluster-window: window size (bp) for clustered SNP detection
  • --cluster-count: minimum SNPs in-window to mark as clustered
  • --no-flag-clustered: skip clustered SNP flagging
  • --no-flag-n-adjacent: skip N-adjacent SNP flagging
  • --no-flag-gap-adjacent: skip gap-adjacent SNP flagging
  • --no-flag-frame-break: skip frame-breaking indel flagging
  • --flag-removal-threshold: mark sequence for removal above this flagged-site count

Apply mask (mask)

raccoon mask <alignment.fasta> \
  --mask-file results/alignment_qc/mask_sites.csv \
  -d results/alignment_qc

Key options:

  • --mask-file: mask CSV file from aln-qc
  • --mask-character: character to use for masking (default: ?)
  • -o, --outfile: output masked alignment file name
  • -d, --outdir: output directory
  • -t, --sequence-type: nt or aa (default: nt)

Phylogenetic QC (tree-qc)

Basic usage:

raccoon tree-qc --tree <treefile> -d outdir \
  --alignment <alignment.fasta> --asr-state <treefile>.state \
  --run-adar --adar-window 300 --adar-min-count 3

Key options:

  • -t, --tree: input phylogeny file (required)
  • --tree-format: auto, newick, or nexus
  • --assembly-refs: assembly/reference FASTA used for mapping
  • --outgroup-ids: comma-separated outgroup sequence IDs
  • --mask-file: optional mask CSV with sites to ignore
  • --tip-fields: template for parsing tip-label fields
  • --tip-field-delimiter: delimiter used for tip field parsing
  • --tip-date-field: field name treated as date in tip parsing
  • --midpoint-root: midpoint-root tree for report visualisation (ignored with --asr-state)
  • --long-branch-sd: SD threshold for long-branch flagging
  • --run-apobec: run APOBEC3 checks
  • --run-adar: run ADAR checks

See full CLI details in [docs/cli.md](docs/cli.md).

## Mask notes

Mask output uses the following note values:

| Note | Meaning |
| --- | --- |
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |

## Example data

The [examples](examples) folder includes a constructed alignment and GenBank reference suitable for quick testing:

- [examples/constructed_alignment.fasta](examples/constructed_alignment.fasta)
- [examples/constructed_reference.gb](examples/constructed_reference.gb)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artic_raccoon-1.0.1.tar.gz (86.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

artic_raccoon-1.0.1-py3-none-any.whl (78.1 kB view details)

Uploaded Python 3

File details

Details for the file artic_raccoon-1.0.1.tar.gz.

File metadata

  • Download URL: artic_raccoon-1.0.1.tar.gz
  • Upload date:
  • Size: 86.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for artic_raccoon-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d3781d6db185cdfcfca971f820c56b538ff55b778e4a15db42232dab16ca8fb4
MD5 260ba5c01664d8dfc5e37127d61bc94c
BLAKE2b-256 3a718e4321a1ea04be2f064f61e393fe964fab1738c329cb9127767c6be16c2b

See more details on using hashes here.

File details

Details for the file artic_raccoon-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: artic_raccoon-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 78.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for artic_raccoon-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a3fb41b3036383055b688fa34b417304f98bd4b2ff3f19fa1b282f3e3a3df627
MD5 936b7cfe42da0ea1f0076a9ac96b84ad
BLAKE2b-256 b97e4a3af8e7c0bf66568d0a46b91d0edd773c0e90164fade5ffee8832df1a6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page