Accessing modified-base data from BAM files.
Project description
Modified-base BAM to bedMethyl
Simple program to aggregate modified base counts stored in a modified-base BAM (Section 2.1) file to a bedMethyl file.
The code uses the methylation
branch of htslib from
jkbonfield.
Installation
The program is available from our conda channel, so can be installed with:
mamba create -n modbam2bed -c conda-forge -c epi2melabs modbam2bed
Packages are available for bothe Linux and MacOS.
Alternatively to install from the source code, clone the repository and then use make:
git clone --recursive <repository>
make modbam2bed
./modbam2bed
See the Makefile for more information. The code has been tested on MacOS (with dependencies from brew) and on Ubuntu 18.04.
Usage
The code requires aligned reads with the Mm
and Ml
tags, and the reference
sequence used for alignment.
Usage: modbam2bed [OPTION...] <reads.bam> <reference.fasta> >
modbam2bed -- summarise a BAM with modified base tags to bedMethyl.
General options:
-e, --extended Output extended bedMethyl including counts of
canonical, modified, and filtered bases (in that
order).
-m, --mod_base=BASE Modified base of interest, one of: 5mC, 5hmC, 5fC,
5caC, 5hmU, 5fU, 5caU, 6mA, 5oxoG, Xao.
-r, --region=chr:start-end Genomic region to process.
-t, --threads=THREADS Number of threads for BAM processing.
Base filtering options:
-a, --canon_threshold=THRESHOLD
Bases with mod. probability < THRESHOLD are
counted as canonical.
-b, --mod_threshold=THRESHOLD Bases with mod. probability > THRESHOLD are
counted as modified.
-c, --cpg Output records filtered to CpG sited.
Read filtering options:
-g, --read_group=RG Only process reads from given read group.
--haplotype=VAL Only process reads from a given haplotype.
Equivalent to --tag_name HP --tag_value VAL.
--tag_name=TN Only process reads with a given tag (see
--tag_value).
--tag_value=VAL Only process reads with a given tag value.
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
Positions absent from the methylation tags are assumed to be canonical.
Positions with modified probability between upper and lower thresholds are
removed from the counting process. Column 5 ("score") of the output is
calculated as the proportion of bases called as the canonical or modified
reference base with respect to the number of spanning reads, scaled to a
maximum of 1000. Column 11 is the percentage of reference base calls identified
as being modified.
Report bugs to chris.wright@nanoporetech.com.
Method
The htslib pileup API is used to create a matrix of per-strand base counts including modified bases and deletions. Inserted bases are not counted. Bases of an abiguous nature, as defined by the two threshold probabilities are masked and used (along with substitutions and deletions) in the definition of the "score" column of the bedMethyl file.
The description of the bedMethyl format on the ENCODE project website is rather loose. The definitions below are chosen pragmatically.
The output bedMethyl file contains per-strand modified base proportions in column 11:
100 * N_mod / (N_mod + N_canon)
where only canonical bases equivalent to the modified base are counted (substitutions and deletions are ignored). Similarly column 5 contains the modified count scaled by the total depth on the strand:
1000 * (N_mod + N_canon) / (N_mod + N_canon + N_filt + N_sub + N_del)
Note the denominator here is the total sequencing depth at the position (including deleted and ambiguous bases); the quantity therefore reflects the extent to which the decision of modified or canonical base is confounded by alternative calls.
Extended Output
As the bedMethyl definition is rather loose, the program allows for an "extended" output which includes three additional columns reporting counts of relevant bases:
- column 12: count of canonical base,
- column 13: count of modified base,
- column 14: count of filtered bases (those with a modification probability
falling between the
--low_threshold
and--high_threshold
parameters).
Limitations
The code has not been developed extensively and currently has some limitations:
- Support for motif filtering is limit to CpG sites. Without this filtering enabled all reference positions that are the canonical base (on forward or reverse strand) equivalent to the modified base under consideration are reported.
- No option to combine per-strand counts into a total count (how to do this generically depends on motif).
- Insertion columns are completely ignored for simplicitely (and avoid any heuristics).
Acknowledgements
We thank jkbonfield for developing the modified base functionality into the htslib pileup API, and Jared Simpson for testing and comparison to his independently developed code.
Help
Licence and Copyright
© 2021 Oxford Nanopore Technologies Ltd.
modbam2bed
is distributed under the terms of the Mozilla Public License 2.0.
Research Release
Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for modbampy-0.2.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52b6a20f924a029863c27bd2b9bb77174f19ae40f77902730d00befdfdd266f0 |
|
MD5 | d074d5d3cbfa37ff3eb96b08302dd8ff |
|
BLAKE2b-256 | 01ef99be5f576ff7db0f873bc7713921bffd2742fc5ccd6ab381f35a99c41854 |
Hashes for modbampy-0.2.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68b2f71492f7c491ce50d7652b203b4e1acfb5a39bee3d636caa1e812a2bc912 |
|
MD5 | fa2321c4a19298dfaa8dea4e84b99904 |
|
BLAKE2b-256 | 350e280e5b17b0bc460560075c77da9bd9299b6b7dcb0c061740a94f8b23c90e |
Hashes for modbampy-0.2.1-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46d4c2c5add64b1c7fba490404ff463b777a1d3f57fe0c645aa5c4563f441263 |
|
MD5 | cf59172636980a6537fea87cdf4e259d |
|
BLAKE2b-256 | cb9201d1ec72f52cd9d18713cfd5c7bb064ea70fe031fe238e265112fcd3beb2 |