A Python package for curating virus genome alignments and phylogenies and flagging QC issues
Project description
raccoon
Rigorous Alignment Curation: Cleanup Of Outliers and Noise
Raccoon is a lightweight toolkit for alignment and phylogenetic QC workflows. It identifies problematic sites (e.g., clustered SNPs, SNPs near Ns/gaps, and frame‑breaking indels) and produces mask files and summaries for downstream analyses.
Contents
Use cases
- Flag clustered SNPs that may indicate contamination, recombination, or misalignment.
- Detect SNPs adjacent to low-coverage regions (Ns) or gaps.
- Identify frame-breaking indels in coding regions using a GenBank reference.
- Generate mask files to exclude suspect sites prior to phylogenetic or evolutionary analyses.
Installation
From source:
pip install .
For development (editable install):
pip install -e .
Quickstart
CLI usage
Show help:
raccoon --help
Sequence QC (seq-qc)
Basic usage:
raccoon seq-qc -f a.fasta b.fasta -o combined.fasta
With metadata-driven headers:
raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
-m metadata.csv other_metadata.csv \
--metadata-id-field sample \
--metadata-location-field location \
--metadata-date-field date \
--header-separator '|'
With a custom header template:
raccoon seq-qc -f a.fasta b.fasta -o combined.fasta \
-m metadata.csv --header-fields "{id}|{country}|{date}"
Key options:
-m, --metadata: metadata CSV file(s) for header harmonisation--metadata-delimiter: metadata delimiter (default,;.tsvauto-detected)--metadata-id-field: metadata ID column (default:sample)--metadata-location-field: metadata location column (default:location)--metadata-date-field: metadata date column (default:date)--header-fields: template for custom headers (e.g.{id}|{country}|{date})--header-separator: separator used for non-template harmonised headers (default:|)--seq-id-delimiter: delimiter for parsing IDs from input headers (default:|)--seq-id-field-index: 0-based field index for parsed sequence ID (default:0)--min-length: minimum sequence length to keep--max-n-content: maximum N-content proportion to keep
Alignment QC (aln-qc)
Basic usage:
raccoon aln-qc <alignment.fasta> -d outdir
With GenBank reference for frame-break checks:
raccoon aln-qc <alignment.fasta> -d outdir \
--genbank <reference.gb> --reference-id <ref_id>
Disable selected flag classes:
raccoon aln-qc <alignment.fasta> -d outdir \
--no-flag-n-adjacent --no-flag-gap-adjacent
Key options:
--max-n-content: N-content threshold for flagging--cluster-window: window size (bp) for clustered SNP detection--cluster-count: minimum SNPs in-window to mark as clustered--no-flag-clustered: skip clustered SNP flagging--no-flag-n-adjacent: skip N-adjacent SNP flagging--no-flag-gap-adjacent: skip gap-adjacent SNP flagging--no-flag-frame-break: skip frame-breaking indel flagging--flag-removal-threshold: mark sequence for removal above this flagged-site count
Apply mask (mask)
raccoon mask <alignment.fasta> \
--mask-file results/alignment_qc/mask_sites.csv \
-d results/alignment_qc
Key options:
--mask-file: mask CSV file fromaln-qc--mask-character: character to use for masking (default:?)-o, --outfile: output masked alignment file name-d, --outdir: output directory-t, --sequence-type:ntoraa(default:nt)
Phylogenetic QC (tree-qc)
Basic usage:
raccoon tree-qc --tree <treefile> -d outdir \
--alignment <alignment.fasta> --asr-state <treefile>.state \
--run-adar --adar-window 300 --adar-min-count 3
Key options:
-t, --tree: input phylogeny file (required)--tree-format:auto,newick, ornexus--assembly-refs: assembly/reference FASTA used for mapping--outgroup-ids: comma-separated outgroup sequence IDs--mask-file: optional mask CSV with sites to ignore--tip-fields: template for parsing tip-label fields--tip-field-delimiter: delimiter used for tip field parsing--tip-date-field: field name treated as date in tip parsing--midpoint-root: midpoint-root tree for report visualisation (ignored with--asr-state)--long-branch-sd: SD threshold for long-branch flagging--run-apobec: run APOBEC3 checks--run-adar: run ADAR checks
See full CLI details in [docs/cli.md](docs/cli.md).
## Mask notes
Mask output uses the following note values:
| Note | Meaning |
| --- | --- |
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |
## Example data
The [examples](examples) folder includes a constructed alignment and GenBank reference suitable for quick testing:
- [examples/constructed_alignment.fasta](examples/constructed_alignment.fasta)
- [examples/constructed_reference.gb](examples/constructed_reference.gb)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file artic_raccoon-1.0.2.tar.gz.
File metadata
- Download URL: artic_raccoon-1.0.2.tar.gz
- Upload date:
- Size: 88.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b97d5c494d383d0e8aa235cf5a8520a03f3118969c282959ef64262abbcf403a
|
|
| MD5 |
683eaec42d61bbe964c3e83530f04401
|
|
| BLAKE2b-256 |
0172368936522217abdc759ab6c0789a8121f5b85aa51380091762f29bd5b9ed
|
File details
Details for the file artic_raccoon-1.0.2-py3-none-any.whl.
File metadata
- Download URL: artic_raccoon-1.0.2-py3-none-any.whl
- Upload date:
- Size: 78.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4e86d0d8895a13def01d7fc4b03d52bca54a282b13797e89ce526afd29fb5b5
|
|
| MD5 |
e1c7147ed24c5f5cc48b1996cf6cee55
|
|
| BLAKE2b-256 |
dd8162051d1c705cec841a5a6cd789021b7765e11224e6d637f5985e88639a78
|