Skip to main content

NucFlag misassembly identifier.

Project description

NucFlag

CI PyPI - Version

Fork of NucFreq. Script for making nucleotide frequency plots and marking misassemblies.

Labeled Misassemblies

Usage

pip install nucflag
usage: nucflag [-h] -i INPUT_BAM [-b INPUT_REGIONS] [-d OUTPUT_PLOT_DIR] [--output_cov_dir OUTPUT_COV_DIR] [-o OUTPUT_MISASM] [-s OUTPUT_STATUS] [-r [REGIONS ...]] [-t THREADS] [-p PROCESSES] [-c CONFIG] [--ignore_regions IGNORE_REGIONS]

Use per-base read coverage to classify/plot misassemblies.

options:
  -h, --help            show this help message and exit
  -i INPUT_BAM, --input_bam INPUT_BAM
                        Input bam file. Must be indexed. (default: None)
  -b INPUT_REGIONS, --input_regions INPUT_REGIONS
                        Bed file with regions to check. (default: None)
  -d OUTPUT_PLOT_DIR, --output_plot_dir OUTPUT_PLOT_DIR
                        Output plot dir. (default: None)
  --output_cov_dir OUTPUT_COV_DIR
                        Output coverage dir. Generates coverage bed files per region. Gzipped by default. (default: None)
  -o OUTPUT_MISASM, --output_misasm OUTPUT_MISASM
                        Output bed file with misassembled regions. (default: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)
  -s OUTPUT_STATUS, --output_status OUTPUT_STATUS
                        Bed file with status of contigs. With format: contig start end misassembled|good (default: None)
  -r [REGIONS ...], --regions [REGIONS ...]
                        Regions with the format: (.+):(\d+)-(\d+) (default: None)
  -t THREADS, --threads THREADS
                        Threads for reading bam file. (default: 4)
  -p PROCESSES, --processes PROCESSES
                        Processes for classifying/plotting. (default: 4)
  -c CONFIG, --config CONFIG
                        Additional threshold/params as toml file. (default: {'first': {'thr_min_peak_horizontal_distance': 1, 'thr_min_peak_width': 20,
                        'thr_min_valley_horizontal_distance': 1, 'thr_min_valley_width': 5, 'thr_peak_height_std_above': 3.2, 'thr_valley_height_std_below': 3,
                        'thr_misjoin_valley': 0.1, 'valley_group_distance': 500, 'peak_group_distance': 500}, 'second': {'thr_min_perc_first': 0.05,
                        'thr_peak_height_std_above': 3, 'group_distance': 30000, 'thr_min_group_size': 20, 'thr_collapse_het_ratio': 0.2}, 'gaps':
                        {'thr_max_allowed_gap_size': 0}})
  --ignore_regions IGNORE_REGIONS
                        Bed file with regions to ignore. With format: contig|all start end absolute|relative (default: None)

Input

A BAM file of an alignment of PacBio HiFi reads to an assembly.

[!IMPORTANT] All assembly contigs, including contaminants, should be included. Omission of these contigs will cause misalignments of reads to elsewhere in the assembly.

Secondary and partial alignments should be removed using SAMtools flag 2308.

Configuration

Configuration can be provided in the form of a toml file.

nucflag -i test/HG00096_hifi_test.bam -b test/test.bed -c config.toml
[first]
# Min horizontal distance between peaks.
thr_min_peak_horizontal_distance = 1
# Min width of peak to consider.
thr_min_peak_width = 20
# Min horizontal distance between valleys.
thr_min_valley_horizontal_distance = 1
# Min width of valley to consider.
thr_min_valley_width = 10
# Number of std above mean to include peak.
thr_peak_height_std_above = 4
# Number of std below mean to include valley.
thr_valley_height_std_below = 3
# Valleys with coverage below this threshold are considered misjoins.
# If float:
# * ex. 0.1 => Valley where min is less than or equal to 10% of mean
# If int:
# * ex. 2 => Valleys with min below 2
thr_misjoin_valley = 0.1
# Group consecutive positions allowing a maximum gap of x.
# Larger value groups more positions.
valley_group_distance = 500
peak_group_distance = 500

[second]
# Percent threshold of most freq base to allow second most freq base
# 10 * 0.1 = 1 so above 1 is allowed.
thr_min_perc_first = 0.1
# Number of std above mean to include peak.
thr_peak_height_std_above = 3
# Group consecutive positions allowing a maximum gap of x.
# Larger value groups more positions.
group_distance = 30_000
# Min group size.
thr_min_group_size = 5
# Het ratio to consider second group a collapse if no overlaps in peaks found.
thr_collapse_het_ratio = 0.1

[gaps]
# Allow gaps up to this length.
thr_max_allowed_gap_size = 1000

Workflow

For an end-to-end workflow, see Snakemake-NucFlag.

Build

To build from source.

git clone git@github.com:logsdon-lab/NucFlag.git && cd NucFlag
make venv && make build && make install

Test

Test BAM filtered from merged alignment of:

To run tests:

source venv/bin/activate
make test

Or try the test example directly.

nucflag -i test/standard/HG00096_hifi.bam -b test/standard/region.bed -c test/config.toml
haplotype2-0000133:3021508-8691473      3314093 3324276 MISJOIN
haplotype2-0000133:3021508-8691473      4126277 4142368 COLLAPSE_VAR
haplotype2-0000133:3021508-8691473      4566798 4683011 GAP
haplotype2-0000133:3021508-8691473      5737835 5747246 MISJOIN
haplotype2-0000133:3021508-8691473      6067838 6072601 COLLAPSE_VAR
haplotype2-0000133:3021508-8691473      6607947 6639102 MISJOIN
haplotype2-0000133:3021508-8691473      7997560 8069465 COLLAPSE_VAR

Test workflow using data/ dir. Requires:

  1. bam files of alignment of HGSVC3 HiFi reads to full assembly.
  2. bed files of centromere + 500kbp regions.
snakemake \
-s test/workflow/Snakefile \
-j 12 \
--executor cluster-generic \
--cluster-generic-submit-cmd "bsub -q epistasis_normal -n {threads} -o /dev/null" \
--use-conda -p

Cite

  • Vollger MR, Dishuck PC, Sorensen M, Welch AE, Dang V, Dougherty ML, et al. Long-read sequence and assembly of segmental duplications. Nat Methods. 2019;16: 88–94. doi:10.1038/s41592-018-0236-3
  • Mc Cartney AM, Shafin K, Alonge M, Bzikadze AV, Formenti G, Fungtammasan A, et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. bioRxiv. 2021. p. 2021.07.02.450803. doi:10.1101/2021.07.02.450803
    • Citing hetDetection.R

TODO

  • Add false duplication detection.
  • Colormap for Misassembly

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nucflag-0.0.11.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

nucflag-0.0.11-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file nucflag-0.0.11.tar.gz.

File metadata

  • Download URL: nucflag-0.0.11.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.1

File hashes

Hashes for nucflag-0.0.11.tar.gz
Algorithm Hash digest
SHA256 22990e2a6ecdd7040da012299a42e82e4f2a6074873b9e3f345c06b3247c4bce
MD5 185f1013fd70d9bf18e558b3f4267e88
BLAKE2b-256 62d8524c20327d53372c0ca117c038b45423a2fef42f5393391b13ffefb36d7f

See more details on using hashes here.

File details

Details for the file nucflag-0.0.11-py3-none-any.whl.

File metadata

  • Download URL: nucflag-0.0.11-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.1

File hashes

Hashes for nucflag-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 3f257b9f2c5c03d1eea2885a4afdd533b5565e6aff1f3540cfe2e4204632ed61
MD5 9d1d45c44c49927dce0f61ced449b4f7
BLAKE2b-256 7bf2b3dc85e07ebde9229283a7708b803bc72f9e016bae0455a47881bc65405b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page