modbampy

Accessing modified-base data from BAM files.

Project description

Oxford Nanopore Technologies logo

Modified-base BAM to bedMethyl

A program to aggregate modified base counts stored in a modified-base BAM (Section 2.1) file to a bedMethyl file.

A Python module is also available to obtain modified base information from BAM files in a convenient form. It is envisaged that this will eventually be replaced by an implementation in pysam.

Installation

The program is available from our conda channel, so can be installed with:

mamba create -n modbam2bed -c bioconda -c conda-forge -c epi2melabs modbam2bed

Packages are available for both Linux and MacOS.

Alternatively to install from the source code, clone the repository and then use make:

git clone --recursive <repository>
make modbam2bed
./modbam2bed

See the Makefile for more information. The code has been tested on MacOS (with dependencies from brew) and on Ubuntu 18.04.

Usage

The code requires aligned reads with the Mm and Ml tags (MM and ML also supported), and the reference sequence used for alignment.

Usage: modbam2bed [OPTION...] <reference.fasta> <reads.bam> [<reads.bam> ...]
modbam2bed -- summarise one or more BAM with modified base tags to bedMethyl.

 General options:
  -e, --extended             Output extended bedMethyl including counts of
                             canonical, modified, and filtered bases (in that
                             order).
  -m, --mod_base=BASE        Modified base of interest, one of: 5mC, 5hmC, 5fC,
                             5caC, 5hmU, 5fU, 5caU, 6mA, 5oxoG, Xao.
  -r, --region=chr:start-end Genomic region to process.
  -t, --threads=THREADS      Number of threads for BAM processing.

 Base filtering options:
  -a, --canon_threshold=THRESHOLD
                             Bases with mod. probability < THRESHOLD are
                             counted as canonical.
  -b, --mod_threshold=THRESHOLD   Bases with mod. probability > THRESHOLD are
                             counted as modified.
  -c, --cpg                  Output records filtered to CpG sites.

 Read filtering options:
  -g, --read_group=RG        Only process reads from given read group.
      --haplotype=VAL        Only process reads from a given haplotype.
                             Equivalent to --tag_name HP --tag_value VAL.
      --tag_name=TN          Only process reads with a given tag (see
                             --tag_value).
      --tag_value=VAL        Only process reads with a given (integer) tag
                             value.

  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Method and output format

The htslib pileup API is used to create a matrix of per-strand base counts including modified bases and deletions. Inserted bases are not counted. Bases of an abiguous nature, as defined by the two threshold probabilities are masked and used (along with substitutions and deletions) in the definition of the "score" (column 5) and "coverage" (column 10) entries of the bedMethyl file.

The description of the bedMethyl format on the ENCODE project website is rather loose. The definitions below are chosen pragmatically.

The table below describes precisely the entries in each column of the output BED file. Columns seven to nine inclusive are included for compatibility with the BED file specification, the values written are fixed and no meaning should be derived from them. Columns 5, 10, and 11 are defined in terms of counts of observed bases to agree with reasonable interpretations of the bedMethyl specifications:

N_canon - canonical (unmodified) base count.
N_mod - modified base count.
N_filt - count of bases where read does not contain a substitution or deletion with respect to the reference, but the modification status is ambiguous: these bases were filtered from the calculation of the modification frequency.
N_sub - count of reads with a substitution with respect to the reference.
N_del - count of reads with a deletion with respect to the reference.

Since these interpretations may differ from other tools an extended output is available (enabled with the -e option) which includes three additional columns with verbatim base counts.

column	description
1	reference sequence name
2	0-based start position
3	0-based exclusive end position (invariably start + 1)
4	Abbreviated name of modified-base examined
5	"Score" 1000 * (N_mod + N_canon) / (N_mod + N_canon + N_filt + N_sub + N_del). The quantity reflects the extent to which the calculated modification frequency in Column 11 is confounded by the alternative calls. The denominator here is the total read coverage as given in Column 10.
6	Strand (of reference sequence). Forward "+", or reverse "-".
7-9	Ignore, included simply for compatibility.
10	Read coverage at reference position including all canonical, modified, undecided (filtered), substitutions from reference, and deletions. N_mod + N_canon + N_filt + N_sub + N_del
11	Percentage of modified bases, as a proportion of canonical and modified (excluding filtered, substitutions, and deletions). 100 * N_mod / (N_mod + N_canon)
12*	N_canon
13*	N_mod
14*	N_filt those bases with a modification probability falling between given thresholds.

* Included in extended output only.

Limitations

The code has not been developed extensively and currently has some limitations:

Support for motif filtering is limit to CpG sites. Without this filtering enabled all reference positions that are the canonical base (on forward or reverse strand) equivalent to the modified base under consideration are reported.
No option to combine per-strand counts into a total count (how to do this generically depends on motif).
Insertion columns are completely ignored for simplicitly (and avoid any heuristics).

Python package

A Python package is available on PyPI which contains basic functionality for parsing BAM files with modified-base information. It is envisaged that this will eventually be replaced by an implementation in pysam. As such the interface is supplements but does not integrate or replace pysam.

The package can be installed with:

pip install modbampy

The package contains simply to modes of use. Firstly an interface to iterate over reads in a BAM file and report modification sites:

from modbampy import ModBam
with ModBam(args.bam) as bam:
    for read in bam.reads(args.chrom, args.start, args.end):
        for pos_mod in read.mod_sites:
            print(*pos_mod)

Each line of the above reports the

read_id,
reference position,
query (read) position,
reference strand (+ or -),
modification strand (0 or 1, as defined in the HTSlib tag specification. This is invariable 0),
canonical base associated with modification,
modified base,
modified-base score (scaled to 0-255).

A second method is provided which mimics the couting procedure implemented in modbam2bed:

from modbampy import ModBam
with ModBam(args.bam) as bam:
    positions, counts = bam.pileup(
        args.chrom, args.start, args.end
        low_threshold=0.33, high_threshold=0.66, mod_base="m")

The result is two numpy arrays. The first indicates the reference positions associated with the counts in the second array. Each row of the second array (counts above) enumerates the observed counts of bases in the order:

a c g t A C G T d D m M f F

where uppercase letters refer to bases on the forward strand, lowercase letters relate to the reverse strand:

A, C, G, T are the usual DNA bases,
D indicates deletion counts,
M modified base counts,
F filtered counts - bases in reads with a modified-base record but which were filtered according to the thresholds provided.

Extras

The read iterator API also contains a minimal set of functionality mirroring properties of alignments available from pysam. See the code for further details.

Acknowledgements

We thank jkbonfield for developing the modified base functionality into the htslib pileup API, and Jared Simpson for testing and comparison to his independently developed code.

Help

Licence and Copyright

modbam2bed is distributed under the terms of the Mozilla Public License 2.0.

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

Project details

Release history Release notifications | RSS feed

0.10.0

Aug 31, 2023

0.9.5

Apr 17, 2023

0.9.4

Mar 30, 2023

0.9.3

Mar 15, 2023

0.9.1

Feb 15, 2023

0.9.0

Feb 13, 2023

0.8.0

Feb 8, 2023

0.7.0

Jan 23, 2023

0.6.3

Sep 23, 2022

0.6.2

Jul 1, 2022

0.6.1 yanked

Jul 1, 2022

Reason this release was yanked:

Broken

0.5.4

Jun 27, 2022

This version

0.5.3

May 10, 2022

0.5.2

Apr 13, 2022

0.5.1

Mar 23, 2022

0.5.0

Mar 1, 2022

0.4.6

Feb 22, 2022

0.4.5

Feb 14, 2022

0.4.4

Feb 11, 2022

0.4.3

Feb 11, 2022

0.4.1

Jan 27, 2022

0.4.0

Jan 13, 2022

0.3.2

Oct 29, 2021

0.3.1

Sep 3, 2021

0.3.0

Aug 23, 2021

0.2.2

Aug 17, 2021

0.2.1

Aug 17, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modbampy-0.5.3.tar.gz (894.7 kB view details)

Uploaded May 10, 2022 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

modbampy-0.5.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.8 MB view details)

Uploaded May 10, 2022 CPython 3.9manylinux: glibc 2.12+ x86-64

modbampy-0.5.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.8 MB view details)

Uploaded May 10, 2022 CPython 3.8manylinux: glibc 2.12+ x86-64

modbampy-0.5.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.8 MB view details)

Uploaded May 10, 2022 CPython 3.7mmanylinux: glibc 2.12+ x86-64

modbampy-0.5.3-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.8 MB view details)

Uploaded May 10, 2022 CPython 3.6mmanylinux: glibc 2.12+ x86-64

File details

Details for the file modbampy-0.5.3.tar.gz.

File metadata

Download URL: modbampy-0.5.3.tar.gz
Upload date: May 10, 2022
Size: 894.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for modbampy-0.5.3.tar.gz
Algorithm	Hash digest
SHA256	`448a367796b0b2431f375a9574e9ddc73bec58ab8c8bf31c40b7ff2be243133a`
MD5	`314da47eb8f1fd9c44947d9b6c0b9e1b`
BLAKE2b-256	`6d42c8700b69dded483d3b86e8e2dcaa990d4a8723f1b139455c25e581b92bf9`

See more details on using hashes here.

File details

Details for the file modbampy-0.5.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

Download URL: modbampy-0.5.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Upload date: May 10, 2022
Size: 3.8 MB
Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for modbampy-0.5.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`5e6cc43f41d15a1e565560dc71c1c9401d572c6e99ab7a0debaf659c0b9489bd`
MD5	`ecf7d38382ac42d04fb77462afb4c6e8`
BLAKE2b-256	`7c16cadea1077adf1a7308a0f32da398ec4d23d41eef86de8c20abfa0aea6b02`

See more details on using hashes here.

File details

Details for the file modbampy-0.5.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

Download URL: modbampy-0.5.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Upload date: May 10, 2022
Size: 3.8 MB
Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for modbampy-0.5.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`5a50783774866067916286df115e2ee524a114dcfc1a0e2345dbd30c1297b274`
MD5	`81e7f651684dc58f3e4d03b97463b7f0`
BLAKE2b-256	`03459177086c167b68e4716fd564552f4db3ed18aa0e783cff508e5128597fea`

See more details on using hashes here.

File details

Details for the file modbampy-0.5.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

Download URL: modbampy-0.5.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Upload date: May 10, 2022
Size: 3.8 MB
Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for modbampy-0.5.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`fb556402d80e9fbafa82ce0b3f5b348d732f6333221a71637834ef9d644cf38d`
MD5	`742fed9c34374e6fab06f94dccf4d9bb`
BLAKE2b-256	`4889f0c3d2d7a53205339fbca8ea41c7026af6b2d9408776711aea696ddea5f5`

See more details on using hashes here.

File details

Details for the file modbampy-0.5.3-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

Download URL: modbampy-0.5.3-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Upload date: May 10, 2022
Size: 3.8 MB
Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for modbampy-0.5.3-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm	Hash digest
SHA256	`c120067a6470c3fdd460205a71acdd35a9b3fb83a6aa22cce6050c17c8be2c9e`
MD5	`5c181149fe457399c949ab1ebc05a86a`
BLAKE2b-256	`059c123e3df00b1a0711e4d77f634e075dfd21b05bb5f25c68e816e081a18dea`

See more details on using hashes here.

modbampy 0.5.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Modified-base BAM to bedMethyl

Installation

Usage

Method and output format

Limitations

Python package

Acknowledgements

Help

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes