Skip to main content

A Python package for curating virus genome alignments and phylogenies and flagging QC issues

Project description

raccoon

raccoon logo

Rigorous Alignment Curation: Cleanup Of Outliers and Noise

Raccoon is a lightweight toolkit for alignment and phylogenetic QC workflows. It identifies problematic sites (e.g., clustered SNPs, SNPs near Ns/gaps, and frame‑breaking indels) and produces mask files and summaries for downstream analyses.


Contents

Use cases

  • Flag clustered SNPs that may indicate contamination, recombination, or misalignment.
  • Detect SNPs adjacent to low-coverage regions (Ns) or gaps.
  • Identify frame-breaking indels in coding regions using a GenBank reference.
  • Generate mask files to exclude suspect sites prior to phylogenetic or evolutionary analyses.

Installation

From source:

pip install .

For development (editable install):

pip install -e .

Quickstart

raccoon aln-qc examples/constructed_alignment.fasta -d outdir \
	--genbank examples/constructed_reference.gb --reference-id ref

Outputs:

  • mask_sites.csv
  • alignment_qc_summary.txt

CLI usage

Show help:

raccoon --help

Alignment QC:

raccoon aln-qc <alignment.fasta> -d outdir

With a GenBank reference for frame‑break detection:

raccoon aln-qc <alignment.fasta> -d outdir \
  --genbank <reference.gb> --reference-id <ref_id>

Masking toggles (defaults are enabled):

raccoon aln-qc <alignment.fasta> -d outdir \
  --no-mask-n-adjacent --no-mask-gap-adjacent

Key alignment options:

  • --n-threshold: fraction of Ns allowed per sequence before flagging.
  • --cluster-window: window size (bp) for clustered SNP detection.
  • --cluster-count: minimum SNPs within a window to flag as clustered.
  • --mask-clustered/--no-mask-clustered: include/exclude clustered SNPs.
  • --mask-n-adjacent/--no-mask-n-adjacent: include/exclude SNPs adjacent to Ns.
  • --mask-gap-adjacent/--no-mask-gap-adjacent: include/exclude SNPs adjacent to gaps.
  • --mask-frame-break/--no-mask-frame-break: include/exclude frame-breaking indels.

Sequence QC:

raccoon seq-qc a.fasta b.fasta -o combined.fasta

With metadata-driven headers:

raccoon seq-qc a.fasta b.fasta -o combined.fasta \
  --metadata metadata.csv other_metadata.csv --metadata-id-field id \
  --metadata-location-field location --metadata-date-field date \
  --header-separator '|'

Phylogenetic QC:

raccoon tree-qc --phylogeny <treefile> -d outdir \
  --alignment <alignment.fasta> --asr-state <treefile>.state \
  --run-adar --adar-window 300 --adar-min-count 3

Key phylo options:

  • --phylogeny: tree file (Newick or Nexus)
  • --alignment: alignment used for ASR state parsing
  • --asr-state: ASR state file (defaults to <treefile>.state if present)
  • --tree-format: auto/newick/nexus
  • --run-adar: enable ADAR-like edit flagging
  • --run-apobec: enable APOBEC3-like edit flagging
  • --adar-window: max distance (bp) for ADAR clustering (default: 300)
  • --adar-min-count: min ADAR sites in window to flag a branch (default: 3)
  • --long-branch-sd: std dev threshold for long-branch flagging (default: 3.0)

See full CLI details in [docs/cli.md](docs/cli.md).

## Mask notes

Mask output uses the following note values:

| Note | Meaning |
| --- | --- |
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |

## Example data

The [examples](examples) folder includes a constructed alignment and GenBank reference suitable for quick testing:

- [examples/constructed_alignment.fasta](examples/constructed_alignment.fasta)
- [examples/constructed_reference.gb](examples/constructed_reference.gb)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artic_raccoon-1.0.0.tar.gz (75.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

artic_raccoon-1.0.0-py3-none-any.whl (75.5 kB view details)

Uploaded Python 3

File details

Details for the file artic_raccoon-1.0.0.tar.gz.

File metadata

  • Download URL: artic_raccoon-1.0.0.tar.gz
  • Upload date:
  • Size: 75.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for artic_raccoon-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cfe1eea7eb97cad58847c44ebb51fc119778dd6d35f522f7b9bca9bab8c862fa
MD5 a3a920e0edae69f8a24dc6dbe7e76eae
BLAKE2b-256 eae4e141519b7fd00bb935879927c0b34a9ef8905862badc5656b79b59dd891c

See more details on using hashes here.

File details

Details for the file artic_raccoon-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: artic_raccoon-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 75.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for artic_raccoon-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d8751de6804f95c7148c35b330e12c054883dd99c5e0346a4ae0b85f2a15db69
MD5 f7f1dbdd665cc67318b427058836f610
BLAKE2b-256 39873a802326db6971a5c8779df88141036f4690376d6bdfff0527071852503f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page