A Python package for curating virus genome alignments and phylogenies and flagging QC issues
Project description
raccoon
Rigorous Alignment Curation: Cleanup Of Outliers and Noise
Raccoon is a lightweight toolkit for alignment and phylogenetic QC workflows. It identifies problematic sites (e.g., clustered SNPs, SNPs near Ns/gaps, and frame‑breaking indels) and produces mask files and summaries for downstream analyses.
Contents
Use cases
- Flag clustered SNPs that may indicate contamination, recombination, or misalignment.
- Detect SNPs adjacent to low-coverage regions (Ns) or gaps.
- Identify frame-breaking indels in coding regions using a GenBank reference.
- Generate mask files to exclude suspect sites prior to phylogenetic or evolutionary analyses.
Installation
From source:
pip install .
For development (editable install):
pip install -e .
Quickstart
raccoon aln-qc examples/constructed_alignment.fasta -d outdir \
--genbank examples/constructed_reference.gb --reference-id ref
Outputs:
- mask_sites.csv
- alignment_qc_summary.txt
CLI usage
Show help:
raccoon --help
Alignment QC:
raccoon aln-qc <alignment.fasta> -d outdir
With a GenBank reference for frame‑break detection:
raccoon aln-qc <alignment.fasta> -d outdir \
--genbank <reference.gb> --reference-id <ref_id>
Masking toggles (defaults are enabled):
raccoon aln-qc <alignment.fasta> -d outdir \
--no-mask-n-adjacent --no-mask-gap-adjacent
Key alignment options:
--n-threshold: fraction of Ns allowed per sequence before flagging.--cluster-window: window size (bp) for clustered SNP detection.--cluster-count: minimum SNPs within a window to flag as clustered.--mask-clustered/--no-mask-clustered: include/exclude clustered SNPs.--mask-n-adjacent/--no-mask-n-adjacent: include/exclude SNPs adjacent to Ns.--mask-gap-adjacent/--no-mask-gap-adjacent: include/exclude SNPs adjacent to gaps.--mask-frame-break/--no-mask-frame-break: include/exclude frame-breaking indels.
Sequence QC:
raccoon seq-qc a.fasta b.fasta -o combined.fasta
With metadata-driven headers:
raccoon seq-qc a.fasta b.fasta -o combined.fasta \
--metadata metadata.csv other_metadata.csv --metadata-id-field id \
--metadata-location-field location --metadata-date-field date \
--header-separator '|'
Phylogenetic QC:
raccoon tree-qc --phylogeny <treefile> -d outdir \
--alignment <alignment.fasta> --asr-state <treefile>.state \
--run-adar --adar-window 300 --adar-min-count 3
Key phylo options:
--phylogeny: tree file (Newick or Nexus)--alignment: alignment used for ASR state parsing--asr-state: ASR state file (defaults to<treefile>.stateif present)--tree-format: auto/newick/nexus--run-adar: enable ADAR-like edit flagging--run-apobec: enable APOBEC3-like edit flagging--adar-window: max distance (bp) for ADAR clustering (default: 300)--adar-min-count: min ADAR sites in window to flag a branch (default: 3)--long-branch-sd: std dev threshold for long-branch flagging (default: 3.0)
See full CLI details in [docs/cli.md](docs/cli.md).
## Mask notes
Mask output uses the following note values:
| Note | Meaning |
| --- | --- |
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |
## Example data
The [examples](examples) folder includes a constructed alignment and GenBank reference suitable for quick testing:
- [examples/constructed_alignment.fasta](examples/constructed_alignment.fasta)
- [examples/constructed_reference.gb](examples/constructed_reference.gb)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file artic_raccoon-1.0.0.tar.gz.
File metadata
- Download URL: artic_raccoon-1.0.0.tar.gz
- Upload date:
- Size: 75.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfe1eea7eb97cad58847c44ebb51fc119778dd6d35f522f7b9bca9bab8c862fa
|
|
| MD5 |
a3a920e0edae69f8a24dc6dbe7e76eae
|
|
| BLAKE2b-256 |
eae4e141519b7fd00bb935879927c0b34a9ef8905862badc5656b79b59dd891c
|
File details
Details for the file artic_raccoon-1.0.0-py3-none-any.whl.
File metadata
- Download URL: artic_raccoon-1.0.0-py3-none-any.whl
- Upload date:
- Size: 75.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8751de6804f95c7148c35b330e12c054883dd99c5e0346a4ae0b85f2a15db69
|
|
| MD5 |
f7f1dbdd665cc67318b427058836f610
|
|
| BLAKE2b-256 |
39873a802326db6971a5c8779df88141036f4690376d6bdfff0527071852503f
|