Skip to main content

Find CDR locations using bedmethyl file and CenSat annotations.(REQUIRES BEDTOOLS INSTALLED)

Project description

hmmCDR

hmmCDR is a set of python scripts to predict CDR regions of active alpha satellite arrays. Using modkit bedMethyl and CenSat Annotations.

Install (NONE UPLOADED YET!)

  1. conda

hmmCDR is available through conda. Install with the following command:

conda install bioconda::hmmCDR
  1. pyPI

hmmCDR is pip installable

pip install hmmCDR
  1. docker

hmmCDR can be run from a Docker Container:

docker run -v ${pwd}:${pwd} jmmenend/hmmCDR:0.3.0 bash

This software is designed to find Centromere Dip Regions (CDRs), subCDRs, and their boundaries within the centromeric active alpha satellite (alpha-sat) array. CDRs are a uniquely hypo-methylated region within the typically hyper-methylated alpha-sat array. CDRs are tightly associated with the histone mark Centromere Protein A (CENP-A). This makes establishing accurate boundaries to CDRs and subCDRs essential to studying their relationship with CENPA. This method combines previous methods of identifying CDRs, through a sliding-window approach, with a Hidden Markov Model(HMM) that uses these sliding window estimates as a prior. The advantage to this two-fold approach is seen at the edges of the CDRs. A sliding window algorithm has a hard time drawing precise boundaries and identifying transitions in/out of the CDRs, whereas the HMM greatly improves identification of these regions.

[include a photo of HMM improvement over sliding window]

[Include the photo I shared with Karen summarizing the HMM]

This python package takes in a bed file of 5mC methylation in aggregate, preferably from modkit, and an Centromere-Satellite Annotation(CenSat) file. The aggregate methylation file is used to determine where the 5mC depleted regions are, and the CenSat Annotation is used to subset the methylation files to only the alpha-sat array. This improves both the speed and accuracy of the CDR identification, as outside this region the trend of hypermethylation is not as strong. This package also processes each chromosome separately and in parallel to further improve speed.

Inputs:

1. Modkit Pileup bedMethyl File:

column name description type
1 chrom name of chromosome/contig str
2 start position 0-based start position int
3 end position 0-based exclusive end position int
4 modified base code single letter code for modified base str
5 score Equal to Nvalid_cov. int
6 strand '+' for positive strand '-' for negative strand, '.' when strands are combined str
7 start position included for compatibility int
8 end position included for compatibility int
9 color included for compatibility, always 255,0,0 str
10 Nvalid_cov Refer to modkit github int
11 fraction modified Nmod / Nvalid_cov float
12 Nmod Refer to modkit github int
13 Ncanonical Refer to modkit github int
14 Nother_mod Refer to modkit github int
15 Ndelete Refer to modkit github int
16 Nfail Refer to modkit github int
17 Ndiff Refer to modkit github int
18 Nnocall Refer to modkit github int

2. CenSat Annotation bed

column name description type
1 chrom name of chromosome/contig str
2 start position 0-based start position int
3 end position 0-based exclusive end position int
4 satellite type/name type of satellite and for some specific name in parentheses str
5 score Not sure what if it is used for anytime. int
6 strand '+' for positive strand '-' for negative strand, '.' if uncertain str
7 start position included for compatibility int
8 end position included for compatibility int
9 color color of the annotation in browser str

Usage:

usage: hmmCDR [-h] [--mod_code MOD_CODE] [--sat_type SAT_TYPE]
              [--min_valid_cov MIN_VALID_COV] [--bedgraph]
              [--window_size WINDOW_SIZE]
              [--priorCDR_percent PRIORCDR_PERCENT]
              [--priorTransition_percent PRIORTRANSITION_PERCENT]
              [--minCDR_size MINCDR_SIZE] [--enrichment]
              [--cdr_priors CDR_PRIORS | --emission_matrix EMISSION_MATRIX | --transition_matrix TRANSITION_MATRIX]
              [--use_percentiles] [--n_iter N_ITER] [-w W] [-x X] [-y Y]
              [-z Z] [--save_intermediates] [--output_label OUTPUT_LABEL]
              bedMethyl_path cenSat_path output_path

Process input files with optional parameters.

positional arguments:
  bedMethyl_path        Path to the bedMethyl file
  cenSat_path           Path to the CenSat BED file
  output_path           Output Path for the output files

optional arguments:
  -h, --help            show this help message and exit
  --mod_code MOD_CODE   Modification code to filter bedMethyl file (default:
                        "m")
  --sat_type SAT_TYPE   Satellite type/name to filter CenSat bed file.
                        (default: "H1L")
  --min_valid_cov MIN_VALID_COV
                        Minimum Valid Coverage to consider a methylation site.
                        (default: 10)
  --bedgraph            Flag indicating if the input is a bedgraph. (default:
                        False)
  --window_size WINDOW_SIZE
                        Window size to calculate prior regions. (default:
                        1020)
  --priorCDR_percent PRIORCDR_PERCENT
                        Percentile for finding priorCDR regions. (default: 5)
  --priorTransition_percent PRIORTRANSITION_PERCENT
                        Percentile for finding priorTransition regions.
                        (default: 10)
  --minCDR_size MINCDR_SIZE
                        Minimum size for CDR regions. (default: 3000)
  --enrichment          Enrichment flag. Pass in if you are looking for
                        methylation enriched regions. (default: False)
  --cdr_priors CDR_PRIORS
                        Path to the priorCDR bedfile
  --emission_matrix EMISSION_MATRIX
                        Path to the emission matrix TSV file
  --transition_matrix TRANSITION_MATRIX
                        Path to the transition matrix TSV file
  --use_percentiles     Use values for flags w,x,y,z as percentile cutoffs for
                        each category. (default: False)
  --n_iter N_ITER       Maximum number of iteration allowed for the HMM.
                        (default: 1)
  -w W                  Theshold for methylation to be classified as very low
                        (default: 0)
  -x X                  Theshold for methylation to be classified as low
                        (default: 25)
  -y Y                  Theshold for methylation to be classified as medium
                        (default: 50)
  -z Z                  Theshold for methylation to be classified as high
                        (default: 75)
  --save_intermediates  Set to true if you would like to save
                        intermediates(filtered beds+window means). (default:
                        False)
  --output_label OUTPUT_LABEL
                        Label to use for name column of hmmCDR BED file. Needs
                        to match priorCDR label. (default: "CDR")
usage: hmmCDRprior [-h] [-m MOD_CODE] [-s SAT_TYPE] [--bedgraph]
                   [-w WINDOW_SIZE] [--priorCDR_percent PRIORCDR_PERCENT]
                   [--priorTransition_percent PRIORTRANSITION_PERCENT]
                   [--minCDR_size MINCDR_SIZE] [--enrichment]
                   [--save_intermediates] [--output_label OUTPUT_LABEL]
                   bedMethyl_path cenSat_path output_path

Process bedMethyl and CenSat BED file to produce hmmCDR priors

positional arguments:
  bedMethyl_path        Path to the bedMethyl file
  cenSat_path           Path to the CenSat BED file
  output_path           Path to the output priorCDRs BED file

optional arguments:
  -h, --help            show this help message and exit
  -m MOD_CODE, --mod_code MOD_CODE
                        Modification code to filter bedMethyl file (default:
                        "m")
  -s SAT_TYPE, --sat_type SAT_TYPE
                        Satellite type/name to filter CenSat bed file.
                        (default: "H1L")
  --bedgraph            Flag indicating if the input is a bedgraph. (default:
                        False)
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Window size to calculate prior regions. (default:
                        1020)
  --priorCDR_percent PRIORCDR_PERCENT
                        Percentile for finding priorCDR regions. (default: 5)
  --priorTransition_percent PRIORTRANSITION_PERCENT
                        Percentile for finding priorTransition regions.
                        (default: 10)
  --minCDR_size MINCDR_SIZE
                        Minimum size for CDR regions. (default: 3000)
  --enrichment          Enrichment flag. Pass in if you are looking for
                        methylation enriched regions. (default: False)
  --save_intermediates  Set to true if you would like to save
                        intermediates(filtered beds+window means). (default:
                        False)
  --output_label OUTPUT_LABEL
                        Label to use for name column of priorCDR BED file.
                        (default: "CDR")
usage: hmmCDRparse [-h] [--bedgraph] [--min_valid_cov MIN_VALID_COV]
                   [-m MOD_CODE] [-s SAT_TYPE]
                   bedMethyl_path cenSat_path output_prefix

Process bedMethyl and CenSat BED file to produce hmmCDR priors

positional arguments:
  bedMethyl_path        Path to the bedMethyl file
  cenSat_path           Path to the CenSat BED file
  output_prefix         Path to the output priorCDRs BED file

optional arguments:
  -h, --help            show this help message and exit
  --bedgraph            Flag indicating if the input is a bedgraph. (default:
                        False)
  --min_valid_cov MIN_VALID_COV
                        Minimum Valid Coverage to consider a methylation site.
                        (default: 10)
  -m MOD_CODE, --mod_code MOD_CODE
                        Modification code to filter bedMethyl file (default:
                        "m")
  -s SAT_TYPE, --sat_type SAT_TYPE
                        Satellite type/name to filter CenSat bed file.
                        (default: "H1L")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hmmCDR-0.1.1.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

hmmCDR-0.1.1-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file hmmCDR-0.1.1.tar.gz.

File metadata

  • Download URL: hmmCDR-0.1.1.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.16

File hashes

Hashes for hmmCDR-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5624dbc0debd339f1e40dcff408f422c75b640342028fbf17fbb560df99bffcf
MD5 790b52ac19b075f826b16f847fcbb04a
BLAKE2b-256 4afe572669d99caa201550dda794808104b4d111eb5fc88ded1d75a61c59b35a

See more details on using hashes here.

File details

Details for the file hmmCDR-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: hmmCDR-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.16

File hashes

Hashes for hmmCDR-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b7de44f2f1fb562734b3476333721d98b286ac6f59b94b6c367fd2ff3e7bebe9
MD5 b7280b6508c21925c4dbe631f5d385ba
BLAKE2b-256 7b56589b3d0179c77971df78657b882b6fa8010766b51d524818ba76c95b4a3a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page