Find CDR locations using bedmethyl file and CenSat annotations.(REQUIRES BEDTOOLS INSTALLED)
Project description
hmmCDR
hmmCDR
is a set of python scripts to predict CDR regions of active alpha satellite arrays. Using modkit bedMethyl and CenSat Annotations.
Install (NONE UPLOADED YET!)
- conda
hmmCDR
is available through conda. Install with the following command:
conda install bioconda::hmmCDR
- pyPI
hmmCDR
is pip installable
pip install hmmCDR
- docker
hmmCDR
can be run from a Docker Container:
docker run -v ${pwd}:${pwd} jmmenend/hmmCDR:0.3.0 bash
This software is designed to find Centromere Dip Regions (CDRs), subCDRs, and their boundaries within the centromeric active alpha satellite (alpha-sat) array. CDRs are a uniquely hypo-methylated region within the typically hyper-methylated alpha-sat array. CDRs are tightly associated with the histone mark Centromere Protein A (CENP-A). This makes establishing accurate boundaries to CDRs and subCDRs essential to studying their relationship with CENPA. This method combines previous methods of identifying CDRs, through a sliding-window approach, with a Hidden Markov Model(HMM) that uses these sliding window estimates as a prior. The advantage to this two-fold approach is seen at the edges of the CDRs. A sliding window algorithm has a hard time drawing precise boundaries and identifying transitions in/out of the CDRs, whereas the HMM greatly improves identification of these regions.
[include a photo of HMM improvement over sliding window]
[Include the photo I shared with Karen summarizing the HMM]
This python package takes in a bed file of 5mC methylation in aggregate, preferably from modkit, and an Centromere-Satellite Annotation(CenSat) file. The aggregate methylation file is used to determine where the 5mC depleted regions are, and the CenSat Annotation is used to subset the methylation files to only the alpha-sat array. This improves both the speed and accuracy of the CDR identification, as outside this region the trend of hypermethylation is not as strong. This package also processes each chromosome separately and in parallel to further improve speed.
Inputs:
1. Modkit Pileup bedMethyl File:
column | name | description | type |
---|---|---|---|
1 | chrom | name of chromosome/contig | str |
2 | start position | 0-based start position | int |
3 | end position | 0-based exclusive end position | int |
4 | modified base code | single letter code for modified base | str |
5 | score | Equal to Nvalid_cov. | int |
6 | strand | '+' for positive strand '-' for negative strand, '.' when strands are combined | str |
7 | start position | included for compatibility | int |
8 | end position | included for compatibility | int |
9 | color | included for compatibility, always 255,0,0 | str |
10 | Nvalid_cov | Refer to modkit github | int |
11 | fraction modified | Nmod / Nvalid_cov | float |
12 | Nmod | Refer to modkit github | int |
13 | Ncanonical | Refer to modkit github | int |
14 | Nother_mod | Refer to modkit github | int |
15 | Ndelete | Refer to modkit github | int |
16 | Nfail | Refer to modkit github | int |
17 | Ndiff | Refer to modkit github | int |
18 | Nnocall | Refer to modkit github | int |
2. CenSat Annotation bed
column | name | description | type |
---|---|---|---|
1 | chrom | name of chromosome/contig | str |
2 | start position | 0-based start position | int |
3 | end position | 0-based exclusive end position | int |
4 | satellite type/name | type of satellite and for some specific name in parentheses | str |
5 | score | Not sure what if it is used for anytime. | int |
6 | strand | '+' for positive strand '-' for negative strand, '.' if uncertain | str |
7 | start position | included for compatibility | int |
8 | end position | included for compatibility | int |
9 | color | color of the annotation in browser | str |
Usage:
usage: hmmCDR [-h] [--mod_code MOD_CODE] [--sat_type SAT_TYPE]
[--min_valid_cov MIN_VALID_COV] [--bedgraph]
[--window_size WINDOW_SIZE]
[--priorCDR_percent PRIORCDR_PERCENT]
[--priorTransition_percent PRIORTRANSITION_PERCENT]
[--minCDR_size MINCDR_SIZE] [--enrichment]
[--cdr_priors CDR_PRIORS | --emission_matrix EMISSION_MATRIX | --transition_matrix TRANSITION_MATRIX]
[--use_percentiles] [--n_iter N_ITER] [-w W] [-x X] [-y Y]
[-z Z] [--save_intermediates] [--output_label OUTPUT_LABEL]
bedMethyl_path cenSat_path output_path
Process input files with optional parameters.
positional arguments:
bedMethyl_path Path to the bedMethyl file
cenSat_path Path to the CenSat BED file
output_path Output Path for the output files
optional arguments:
-h, --help show this help message and exit
--mod_code MOD_CODE Modification code to filter bedMethyl file (default:
"m")
--sat_type SAT_TYPE Satellite type/name to filter CenSat bed file.
(default: "H1L")
--min_valid_cov MIN_VALID_COV
Minimum Valid Coverage to consider a methylation site.
(default: 10)
--bedgraph Flag indicating if the input is a bedgraph. (default:
False)
--window_size WINDOW_SIZE
Window size to calculate prior regions. (default:
1020)
--priorCDR_percent PRIORCDR_PERCENT
Percentile for finding priorCDR regions. (default: 5)
--priorTransition_percent PRIORTRANSITION_PERCENT
Percentile for finding priorTransition regions.
(default: 10)
--minCDR_size MINCDR_SIZE
Minimum size for CDR regions. (default: 3000)
--enrichment Enrichment flag. Pass in if you are looking for
methylation enriched regions. (default: False)
--cdr_priors CDR_PRIORS
Path to the priorCDR bedfile
--emission_matrix EMISSION_MATRIX
Path to the emission matrix TSV file
--transition_matrix TRANSITION_MATRIX
Path to the transition matrix TSV file
--use_percentiles Use values for flags w,x,y,z as percentile cutoffs for
each category. (default: False)
--n_iter N_ITER Maximum number of iteration allowed for the HMM.
(default: 1)
-w W Theshold for methylation to be classified as very low
(default: 0)
-x X Theshold for methylation to be classified as low
(default: 25)
-y Y Theshold for methylation to be classified as medium
(default: 50)
-z Z Theshold for methylation to be classified as high
(default: 75)
--save_intermediates Set to true if you would like to save
intermediates(filtered beds+window means). (default:
False)
--output_label OUTPUT_LABEL
Label to use for name column of hmmCDR BED file. Needs
to match priorCDR label. (default: "CDR")
usage: hmmCDRprior [-h] [-m MOD_CODE] [-s SAT_TYPE] [--bedgraph]
[-w WINDOW_SIZE] [--priorCDR_percent PRIORCDR_PERCENT]
[--priorTransition_percent PRIORTRANSITION_PERCENT]
[--minCDR_size MINCDR_SIZE] [--enrichment]
[--save_intermediates] [--output_label OUTPUT_LABEL]
bedMethyl_path cenSat_path output_path
Process bedMethyl and CenSat BED file to produce hmmCDR priors
positional arguments:
bedMethyl_path Path to the bedMethyl file
cenSat_path Path to the CenSat BED file
output_path Path to the output priorCDRs BED file
optional arguments:
-h, --help show this help message and exit
-m MOD_CODE, --mod_code MOD_CODE
Modification code to filter bedMethyl file (default:
"m")
-s SAT_TYPE, --sat_type SAT_TYPE
Satellite type/name to filter CenSat bed file.
(default: "H1L")
--bedgraph Flag indicating if the input is a bedgraph. (default:
False)
-w WINDOW_SIZE, --window_size WINDOW_SIZE
Window size to calculate prior regions. (default:
1020)
--priorCDR_percent PRIORCDR_PERCENT
Percentile for finding priorCDR regions. (default: 5)
--priorTransition_percent PRIORTRANSITION_PERCENT
Percentile for finding priorTransition regions.
(default: 10)
--minCDR_size MINCDR_SIZE
Minimum size for CDR regions. (default: 3000)
--enrichment Enrichment flag. Pass in if you are looking for
methylation enriched regions. (default: False)
--save_intermediates Set to true if you would like to save
intermediates(filtered beds+window means). (default:
False)
--output_label OUTPUT_LABEL
Label to use for name column of priorCDR BED file.
(default: "CDR")
usage: hmmCDRparse [-h] [--bedgraph] [--min_valid_cov MIN_VALID_COV]
[-m MOD_CODE] [-s SAT_TYPE]
bedMethyl_path cenSat_path output_prefix
Process bedMethyl and CenSat BED file to produce hmmCDR priors
positional arguments:
bedMethyl_path Path to the bedMethyl file
cenSat_path Path to the CenSat BED file
output_prefix Path to the output priorCDRs BED file
optional arguments:
-h, --help show this help message and exit
--bedgraph Flag indicating if the input is a bedgraph. (default:
False)
--min_valid_cov MIN_VALID_COV
Minimum Valid Coverage to consider a methylation site.
(default: 10)
-m MOD_CODE, --mod_code MOD_CODE
Modification code to filter bedMethyl file (default:
"m")
-s SAT_TYPE, --sat_type SAT_TYPE
Satellite type/name to filter CenSat bed file.
(default: "H1L")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hmmCDR-0.1.1.tar.gz
.
File metadata
- Download URL: hmmCDR-0.1.1.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5624dbc0debd339f1e40dcff408f422c75b640342028fbf17fbb560df99bffcf |
|
MD5 | 790b52ac19b075f826b16f847fcbb04a |
|
BLAKE2b-256 | 4afe572669d99caa201550dda794808104b4d111eb5fc88ded1d75a61c59b35a |
File details
Details for the file hmmCDR-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: hmmCDR-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7de44f2f1fb562734b3476333721d98b286ac6f59b94b6c367fd2ff3e7bebe9 |
|
MD5 | b7280b6508c21925c4dbe631f5d385ba |
|
BLAKE2b-256 | 7b56589b3d0179c77971df78657b882b6fa8010766b51d524818ba76c95b4a3a |