Skip to main content

Accessing modified-base data from BAM files.

Project description

Oxford Nanopore Technologies logo

We have a new bioinformatic resource that replaces the functionality of this project! See our new repository here: modkit.

This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: support@nanoporetech.com for help with your application if it is not possible to upgrade.


Modified-base BAM to bedMethyl

A program to aggregate modified base counts stored in a modified-base BAM (Section 2.1) file to a bedMethyl file.

A Python module is also available to obtain modified base information from BAM files in a convenient form. It is envisaged that this will eventually be replaced by an implementation in pysam.

Installation

The program is available from our conda channel, so can be installed with:

mamba create -n modbam2bed -c bioconda -c conda-forge -c epi2melabs modbam2bed

Packages are available for both Linux and MacOS.

Alternatively to install from the source code, clone the repository and then use make:

git clone --recursive https://github.com/epi2me-labs/modbam2bed.git
make modbam2bed
./modbam2bed

See the Makefile for more information. The code has been tested on MacOS (with dependencies from brew) and on Ubuntu 18.04 and 20.04.

Usage

The code requires aligned reads with the Mm and Ml tags (MM and ML also supported), and the reference sequence used for alignment.

The below is a snapshot of the command-line interface; it may not be up-to-date, please refer to the program --help option for the most accurate guidance.

Usage: modbam2bed [OPTION...] <reference.fasta> <reads.bam> [<reads.bam> ...]
modbam2bed -- summarise one or more BAM with modified base tags to bedMethyl. 

 General options:
      --aggregate            Output additional aggregated (across strand)
                             counts, requires --cpg or --chg.
      --combine              Create output with combined modified counts: i.e.
                             alternative modified bases within the same family
                             (same canonical base) are included.
  -c, --pileup               Output (full) raw base counts rather than BED
                             file.
  -e, --extended             Output extended bedMethyl including counts of
                             canonical, modified, and filtered bases (in that
                             order).
  -m, --mod_base=BASE        Modified base of interest, one of: 5mC, 5hmC, 5fC,
                             5caC, 5hmU, 5fU, 5caU, 6mA, 5oxoG, Xao. (Or modA,
                             modC, modG, modT, modU, modN for generic modified
                             base).
  -p, --prefix=PREFIX        Output file prefix. Only used when multiple output
                             filters are given.
  -r, --region=chr:start-end Genomic region to process.
  -t, --threads=THREADS      Number of threads for BAM processing.

 Base filtering options:
  -a, --canon_threshold=THRESHOLD
                             Deprecated. The option will be removed in a future
                             version. Please use --threshold.
  -b, --mod_threshold=THRESHOLD   Deprecated. The option will be removed in a
                             future version. Please use --threshold.
      --chg                  Output records filtered to CHG sites.
      --chh                  Output records filtered to CHH sites.
      --cpg                  Output records filtered to CpG sites.
  -f, --threshold=THRESHOLD  Bases with a call probability < THRESHOLD are
                             filtered from results (default 0.66).
  -k, --mask                 Respect soft-masking in reference file.

 Read filtering options:
  -d, --max_depth=DEPTH      Max. per-file depth; avoids excessive memory
                             usage.
  -g, --read_group=RG        Only process reads from given read group.
      --haplotype=VAL        Only process reads from a given haplotype.
                             Equivalent to --tag_name HP --tag_value VAL.
      --tag_name=TN          Only process reads with a given tag (see
                             --tag_value).
      --tag_value=VAL        Only process reads with a given tag value.

  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Method and output format

Oxford Nanopore Technogies' sequencing chemistries and basecallers can detect any number of modified bases. Compared to traditional methods which force a false dichoctomy between say cytosine and 5-methylcytosine, this rich biology needs to be remembered when interpreting modified base calls.

The htslib pileup API is used to create a matrix of per-strand base counts including substitutions, modified bases and deletions. Inserted bases are not counted. Bases of an abiguous nature (refered to as "filtered" below), as defined by the filter threshold probabilities option -b are masked and used (along with substitutions and deletions) in the definition of the "score" (column 5) and "coverage" (column 10) entries of the bedMethyl file.

In the case of ?-style MM subtags, where a lack of a recorded call should not be taken as implying a canonical-base call, the "no call" count is incremented. The "no call" count is used in the calculation of "coverage" and also the denominator of "score".

In summary, a base is determined as being either "canonical", "modified", "filtered", or "no call". The final output includes a modification frequency and score and coverage information in order to assess the reliability of the frequency.

Call filtering

To determine the base present at a locus in a read, the query base in the BAM record is examined along with the modified base information. A "canonical" base probability is calculated as 1 - sum(P_mod), with P_mod being the set of probabilities associated with all the modifications enumerated in the BAM record. The base form with largest probability is taken as the base present subject to the user-specified threshold. If the probability is below the threshold the call is masked and contributes to the "filtered" base count rather than the "canonical" or "modified" counts.

Special Handling of alternative modified bases (--combine option)

To intepret the case of multiple modifications being listed in the BAM, modbam2bed can operate in two modes:

  • default: alternative modified bases in the same family as the requested modification are counted separatedly as "other" --- neither in the "canonical" count of the "modified" count.
  • --combine: alternative modified bases are lumped together into the "modified" count and ultimately into a single modification frequency.

A particular case where --combine is useful is when comparing to the result of bisulfite sequencing.

Output format

The description of the bedMethyl format on the ENCODE project website is rather loose. The definitions below are chosen pragmatically.

The table below describes precisely the entries in each column of the output BED file. Columns seven to nine inclusive are included for compatibility with the BED file specification, the values written are fixed and no meaning should be derived from them. Columns 5, 10, and 11 are defined in terms of counts of observed bases to agree with reasonable interpretations of the bedMethyl specifications:

  • Ncanon - canonical (unmodified) base count, (contigent on the use of --combine, see above.)
  • Nmod - modified base count.
  • Nfilt - count of bases where read does not contain a substitution or deletion with respect to the reference, but the modification status is ambiguous: these bases were filtered from the calculation of the modification frequency.
  • Nsub - count of reads with a substitution with respect to the reference.
  • Ndel - count of reads with a deletion with respect to the reference.
  • Nno call - counts of reads with an absent modification call (but not a substitution or deletion).
  • Nalt mod - counts of reads with and alternative modification call (but not a substitution or deletion).

Since these interpretations may differ from other tools an extended output is available (enabled with the -e option) which includes three additional columns with verbatim base counts.

column description
1 reference sequence name
2 0-based start position
3 0-based exclusive end position (invariably start + 1)
4 Abbreviated name of modified-base examined
5 "Score" 1000 * (Nmod + Ncanon) / (Nmod + Ncanon + Nno call + Nalt mod + Nfilt + Nsub + Ndel). The quantity reflects the extent to which the calculated modification frequency in Column 11 is confounded by the alternative calls. The denominator here is the total read coverage as given in Column 10.
6 Strand (of reference sequence). Forward "+", or reverse "-".
7-9 Ignore, included simply for compatibility.
10 Read coverage at reference position including all canonical, modified, undecided (no calls and filtered), substitutions from reference, and deletions. Nmod + Ncanon + Nno call + Nalt mod + Nfilt + Nsub + Ndel
11 Percentage of modified bases, as a proportion of canonical and modified (excluding no calls, filtered, substitutions, and deletions). 100 * Nmod / (Nmod + Nalt mod + Ncanon)
12* Ncanon
13* Nmod
14* Nfilt those bases with a modification probability falling between given thresholds.
15* Nno call those bases for which the query base was the correct canonical base for the modified base being considered, but no call was made (see the definition of the . and ? flags in the SAM tag specification).
16* Nalt mod those bases for which the query base was the correct canonical base for the modified base being considered, but and alternative modification was present.

* Included in extended output only.

Limitations

The code has not been developed extensively and currently has some limitations:

  • Support for motif filtering is limited to CpG, CHG, and CHH, sites. Without this filtering enabled all reference positions that are the canonical base (on forward or reverse strand) equivalent to the modified base under consideration are reported.
  • Insertion columns are completely ignored for simplicitly (and avoid any heuristics).
  • Second strand MM subtags (i.e. MM:C-m as compared with MM:C+m) are not supported. These are not typically used so shouldn't affect most users. If such a tag is detected and warning will be thrown and the tag ignored. These tags do come in to play for duplex basecalls.

Python package

A Python package is available on PyPI which contains basic functionality for parsing BAM files with modified-base information. It is envisaged that this will eventually be replaced by an implementation in pysam. As such the interface is supplements but does not integrate or replace pysam.

The package can be installed with:

pip install modbampy

The package contains simply to modes of use. Firstly an interface to iterate over reads in a BAM file and report modification sites:

from modbampy import ModBam
with ModBam(args.bam) as bam:
    for read in bam.reads(args.chrom, args.start, args.end):
        for pos_mod in read.mod_sites:
            print(*pos_mod)

Each line of the above reports the

  • read_id,
  • reference position,
  • query (read) position,
  • reference strand (+ or -),
  • modification strand (0 or 1, as defined in the HTSlib tag specification. This is invariable 0),
  • canonical base associated with modification,
  • modified base,
  • modified-base score (scaled to 0-255).

A second method is provided which mimics the couting procedure implemented in modbam2bed:

from modbampy import ModBam
with ModBam(args.bam) as bam:
    positions, counts = bam.pileup(
        args.chrom, args.start, args.end
        low_threshold=0.33, high_threshold=0.66, mod_base="m")

The result is two numpy arrays. The first indicates the reference positions associated with the counts in the second array. Each row of the second array (counts above) enumerates the observed counts of bases in the order:

a c g t A C G T d D m M f F n N

where uppercase letters refer to bases on the forward strand, lowercase letters relate to the reverse strand:

  • A, C, G, T are the usual DNA bases,
  • D indicates deletion counts,
  • M modified base counts,
  • F filtered counts - bases in reads with a modified-base record but which were filtered according to the thresholds provided.
  • N no call base counts.

Extras

The read iterator API also contains a minimal set of functionality mirroring properties of alignments available from pysam. See the code for further details.

Acknowledgements

We thank jkbonfield for developing the modified base functionality into the htslib pileup API, and Jared Simpson for testing and comparison to his independently developed code.

Help

Licence and Copyright

© 2021- Oxford Nanopore Technologies Ltd.

modbam2bed is distributed under the terms of the Mozilla Public License 2.0.

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modbampy-0.10.0.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

modbampy-0.10.0-cp310-cp310-manylinux_2_24_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.24+ x86-64

modbampy-0.10.0-cp39-cp39-manylinux_2_24_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.24+ x86-64

modbampy-0.10.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

modbampy-0.10.0-cp38-cp38-manylinux_2_24_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.24+ x86-64

modbampy-0.10.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

modbampy-0.10.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

modbampy-0.10.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

modbampy-0.10.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

File details

Details for the file modbampy-0.10.0.tar.gz.

File metadata

  • Download URL: modbampy-0.10.0.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for modbampy-0.10.0.tar.gz
Algorithm Hash digest
SHA256 89d01792ce9871f113a785520d04192c5f79cde223459a36cc8301b7eeca4d42
MD5 83130b5f89d965221c0c620e8a4d14a6
BLAKE2b-256 673686cdb9b6ea509480ea85b157687d2e6ff3cbce0e747778ac9dcd0e4cf6f7

See more details on using hashes here.

File details

Details for the file modbampy-0.10.0-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for modbampy-0.10.0-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 1ebe97d6cae93176d05c1f19793e79f455176c0a573733a16bb0341c97de0ab6
MD5 7357f5169533e16eba63eaaffa3e325a
BLAKE2b-256 b386229ba5912dde6997837f6ab3bf9c15ff03405522563b0d7a5f24b7fbe4d1

See more details on using hashes here.

File details

Details for the file modbampy-0.10.0-cp39-cp39-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for modbampy-0.10.0-cp39-cp39-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 18917608c01918a3416be769f449239da734df4198c02b8a9fa3af981db9f577
MD5 6d15da32ae797a205e3aa6cce97627c0
BLAKE2b-256 67d035aac977a4f1f79d809bf93350ed432ae18972854e1cccbd1f9bcc84c569

See more details on using hashes here.

File details

Details for the file modbampy-0.10.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for modbampy-0.10.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8581ff0ba79268ff625ad046fd76294de567762df0ba7f07f707c02761db9f10
MD5 5eb2297e3e5975e66e5e6f290d5b13e8
BLAKE2b-256 6b909d09c9805d65b024dd71fa24607818b9efda9f10808549efefcc8131e72b

See more details on using hashes here.

File details

Details for the file modbampy-0.10.0-cp38-cp38-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for modbampy-0.10.0-cp38-cp38-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 cc655c15e4c4afc894c96aa96612256dab673874d039429dfc548602c99cd5b0
MD5 b1ce7bcd7afb786291b2976835f3d93d
BLAKE2b-256 a6240e8b7a9fdbc6b0dc3c2d87c3801d7e502d1dcc3e489fcc37532f66d016d5

See more details on using hashes here.

File details

Details for the file modbampy-0.10.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for modbampy-0.10.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 72cedfb8ec7550cbeab123e336d736c283c2f3a1aab8699de1d4a4bfbf2774c9
MD5 aa0faa9d1aa05ea2bae0a30e59feebd2
BLAKE2b-256 2ed1ab58116b819e804f1d83d39ee0ad1a2c9c1e54c60f67ca08ceab2fb780ac

See more details on using hashes here.

File details

Details for the file modbampy-0.10.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for modbampy-0.10.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 309eb6206115b3c3b92778b59d4e668446b2713cb7652519a980eec507d31e51
MD5 7e6abcba84505fa068965e5775e924f0
BLAKE2b-256 30dbd66fdc693a0d327874fd2053c060201239984e0038f1a839ce6afc5c8b3f

See more details on using hashes here.

File details

Details for the file modbampy-0.10.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for modbampy-0.10.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5ae05f3203bd1b5e9c33d1def773f3e911e42b4e02dd48749bf6d71e077cfd3a
MD5 4df7536bb6498b37026978c6cd32be42
BLAKE2b-256 2eaa586bc4151c92386a0c8e3c47a8597facd79e34b8fe0d1d426ce3c815df69

See more details on using hashes here.

File details

Details for the file modbampy-0.10.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for modbampy-0.10.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5c4942fb2c4c64327c872dc826101b6875bf26e177e3dcda6b73a6bdf0ced658
MD5 774fe3db9fb79faa757f3293b3b1da4a
BLAKE2b-256 d22c350c8dd5bee8a9e6d3e22f61e9d339327a3aacfc116c6d7d9a349f5caf2c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page