NucFlag misassembly identifier.
Project description
NucFlag
Fork of NucFreq
. Script for making nucleotide frequency plots and marking misassemblies.
Usage
pip install nucflag
usage: nucflag [-h] -i INPUT_BAM [-b INPUT_REGIONS] [-d OUTPUT_PLOT_DIR] [--output_cov_dir OUTPUT_COV_DIR] [-o OUTPUT_MISASM] [-s OUTPUT_STATUS] [-r [REGIONS ...]] [-t THREADS] [-p PROCESSES] [-c CONFIG] [--ignore_regions IGNORE_REGIONS]
Use per-base read coverage to classify/plot misassemblies.
options:
-h, --help show this help message and exit
-i INPUT_BAM, --input_bam INPUT_BAM
Input bam file. Must be indexed. (default: None)
-b INPUT_REGIONS, --input_regions INPUT_REGIONS
Bed file with regions to check. (default: None)
-d OUTPUT_PLOT_DIR, --output_plot_dir OUTPUT_PLOT_DIR
Output plot dir. (default: None)
--output_cov_dir OUTPUT_COV_DIR
Output coverage dir. Generates coverage bed files per region. Gzipped by default. (default: None)
-o OUTPUT_MISASM, --output_misasm OUTPUT_MISASM
Output bed file with misassembled regions. (default: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)
-s OUTPUT_STATUS, --output_status OUTPUT_STATUS
Bed file with status of contigs. With format: contig start end misassembled|good (default: None)
-r [REGIONS ...], --regions [REGIONS ...]
Regions with the format: (.+):(\d+)-(\d+) (default: None)
-t THREADS, --threads THREADS
Threads for reading bam file. (default: 4)
-p PROCESSES, --processes PROCESSES
Processes for classifying/plotting. (default: 4)
-c CONFIG, --config CONFIG
Additional threshold/params as toml file. (default: {'first': {'thr_min_peak_horizontal_distance': 1, 'thr_min_peak_width': 20,
'thr_min_valley_horizontal_distance': 1, 'thr_min_valley_width': 5, 'thr_peak_height_std_above': 3.2, 'thr_valley_height_std_below': 3,
'thr_misjoin_valley': 0.1, 'valley_group_distance': 500, 'peak_group_distance': 500}, 'second': {'thr_min_perc_first': 0.05,
'thr_peak_height_std_above': 3, 'group_distance': 30000, 'thr_min_group_size': 20, 'thr_collapse_het_ratio': 0.2}, 'gaps':
{'thr_max_allowed_gap_size': 0}})
--ignore_regions IGNORE_REGIONS
Bed file with regions to ignore. With format: contig|all start end absolute|relative (default: None)
Input
A BAM file of an alignment of PacBio HiFi reads to an assembly.
[!IMPORTANT] All assembly contigs, including contaminants, should be included. Omission of these contigs will cause misalignments of reads to elsewhere in the assembly.
Secondary and partial alignments should be removed using SAMtools flag 2308.
Configuration
Configuration can be provided in the form of a toml
file.
nucflag -i test/HG00096_hifi_test.bam -b test/test.bed -c config.toml
[first]
# Min horizontal distance between peaks.
thr_min_peak_horizontal_distance = 1
# Min width of peak to consider.
thr_min_peak_width = 20
# Min horizontal distance between valleys.
thr_min_valley_horizontal_distance = 1
# Min width of valley to consider.
thr_min_valley_width = 10
# Number of std above mean to include peak.
thr_peak_height_std_above = 4
# Number of std below mean to include valley.
thr_valley_height_std_below = 3
# Valleys with coverage below this threshold are considered misjoins.
# If float:
# * ex. 0.1 => Valley where median is less than or equal to 10% of mean
# If int:
# * ex. 2 => Valleys with median below 2
thr_misjoin_valley = 0.1
# Group consecutive positions allowing a maximum gap of x.
# Larger value groups more positions.
valley_group_distance = 500
peak_group_distance = 500
[second]
# Percent threshold of most freq base to allow second most freq base
# 10 * 0.1 = 1 so above 1 is allowed.
thr_min_perc_first = 0.1
# Number of std above mean to include peak.
thr_peak_height_std_above = 3
# Group consecutive positions allowing a maximum gap of x.
# Larger value groups more positions.
group_distance = 30_000
# Min group size.
thr_min_group_size = 5
# Het ratio to consider second group a collapse if no overlaps in peaks found.
thr_collapse_het_ratio = 0.1
[gaps]
# Allow gaps up to this length.
thr_max_allowed_gap_size = 1000
Workflow
For an end-to-end workflow, see Snakemake-NucFlag
.
Build
To build from source.
git clone git@github.com:logsdon-lab/NucFlag.git && cd NucFlag
make venv && make build && make install
Test
Test BAM filtered from merged alignment of:
- PacBio HiFi reads from HGSVC sample
HG00096
. - Verkko v1.4.1 combined assembly for HGSVC sample
HG00096
To run tests:
source venv/bin/activate
make test
Or try the test example directly.
nucflag -i test/standard/HG00096_hifi.bam -b test/standard/region.bed -c test/config.toml
haplotype2-0000133:3021508-8691473 3314093 3324276 MISJOIN
haplotype2-0000133:3021508-8691473 4126277 4142368 COLLAPSE_VAR
haplotype2-0000133:3021508-8691473 4566798 4683011 GAP
haplotype2-0000133:3021508-8691473 5737835 5747246 MISJOIN
haplotype2-0000133:3021508-8691473 6067838 6072601 COLLAPSE_VAR
haplotype2-0000133:3021508-8691473 6607947 6639102 MISJOIN
haplotype2-0000133:3021508-8691473 7997560 8069465 COLLAPSE_VAR
Test workflow using data/
dir. Requires:
bam
files of alignment of HGSVC3 HiFi reads to full assembly.bed
files of centromere + 500kbp regions.
snakemake \
-s test/workflow/Snakefile \
-j 12 \
--executor cluster-generic \
--cluster-generic-submit-cmd "bsub -q epistasis_normal -n {threads} -o /dev/null" \
--use-conda -p
Cite
- Vollger MR, Dishuck PC, Sorensen M, Welch AE, Dang V, Dougherty ML, et al. Long-read sequence and assembly of segmental duplications. Nat Methods. 2019;16: 88–94. doi:10.1038/s41592-018-0236-3
- Mc Cartney AM, Shafin K, Alonge M, Bzikadze AV, Formenti G, Fungtammasan A, et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. bioRxiv. 2021. p. 2021.07.02.450803. doi:10.1101/2021.07.02.450803
- Citing
hetDetection.R
- Citing
TODO
- Add false duplication detection.
- Colormap for
Misassembly
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nucflag-0.0.10.tar.gz
.
File metadata
- Download URL: nucflag-0.0.10.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d542a1065d5279c0ef48534a78680f1f6cac381d58107a093ea1959ccaa528b2 |
|
MD5 | f36fea20f900258afee6683d5a4f1ae5 |
|
BLAKE2b-256 | 6cc3de4be5aa7136bfd0c02cfa91fe0f86dcbbb904e878d986ebe0e07bd5888f |
File details
Details for the file nucflag-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: nucflag-0.0.10-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 36f4264c258f4c0cc7448ebf74e8c3c0ec8719321fd5417b1edf92d846f4b48d |
|
MD5 | 6dd46ea104cfd400ea8d84ca34b038ab |
|
BLAKE2b-256 | cb5ffcffe93847c42b0b9b9115750a8c7977263a43f9522e89a6c5b37aac8587 |