Skip to main content

A package for accessing 2bit files using lib2bit

Project description

Build Status

py2bit

A python extension, written in C, for quick access to 2bit files. The extension uses lib2bit for file access.

Table of Contents

Installation

You can install the extension directly from github with:

pip install git+https://github.com/dpryan79/py2bit

Usage

Basic usage is as follows:

Load the extension

>>> import py2bit

Open a 2bit file

This will work if your working directory is the py2bit source code directory.

>>> tb = py2bit.open("test/foo.2bit")

Note that if you would like to include information about soft-masked bases, you need to manually specify that:

>>> tb = py2bit.open("test/foo.2bit", True)

Access the list of chromosomes and the lengths

TwoBit objects contain a dictionary holding the chromosome/contig lengths, which can be accessed with the chroms() method.

>>> tb.chroms()
{'chr1': 150L, 'chr2': 100L}

You can directly access a particular chromosome by specifying its name.

>>> tb.chroms('chr1')
150L

The lengths are stored as a "long" integer type, which is why there's an L suffix. If you specify a nonexistent chromosome then nothing is output.

>>> tb.chroms("foo")
>>>

Print file information

The following information about and contained within a 2bit file can be accessed with the info() method:

  • file size, in bytes (file size)
  • number of chromosomes/contigs (nChroms)
  • total sequence length, in bases (sequence length)
  • total number of hard-masked (N) bases (hard-masked length)
  • total number of soft-masked (lower case) bases(soft-masked length).

Note that soft-masked length will only be present if open("file.2bit", True) is used, since handling soft-masking increases memory requirements and decreases perfomance.

>>> tb.info()
{'file size': 161, 'nChroms': 2, 'sequence length': 250, 'hard-masked length': 150, 'soft-masked length': 8}

Fetch a sequence

The sequence of a full or partial chromosome/contig can be fetched with the sequence() method.

>>> tb.sequence("chr1")
'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'

By default, the whole chromosome/contig is returned. A specific range can also be requested.

>>> tb.sequence("chr1", 24, 74)
NNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATC

The first number is the (0-based) position on the chromosome/contig where the sequence should begin. The second number is the (1-based) position on the chromosome where the sequence should end.

If it was requested during file opening that soft-masking information be stored, then lower case bases may be present. If a nonexistent chromosome/contig is specified then a runtime error occurs.

Fetch per-base statistics

It's often required to compute the percentage of 1 or more bases in a chromosome. This can be done with the bases() method.

>>> tb.bases("chr1")
{'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667}

This returns a dictionary with bases as keys and the fraction of the sequence composed of them as values. Note that this will not sum to 1 if there are any hard-masked bases (the chromosome is 2/3 N in this case). One can also request this information over a particular region.

>>> tb.bases("chr1", 24, 74)
{'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12}

The start and end position are as with the sequence() method described above.

If integer counts are preferred, then they can instead be returned.

>>> tb.bases("chr1", 24, 74, False)
{'A': 6, 'C': 6, 'T': 6, 'G': 6}

Fetch masked blocks

There are two kinds of masking blocks that can be present in 2bit files: hard-masked and soft-masked. Hard-masked blocks are stretches of NNNN, as are commonly found near telomeres and centromeres. Soft-masked blocks are runs of lowercase A/C/T/G, typically indicating repeat elements or low-complexity stretches. In can sometimes be useful to query this information from 2bit files:

>>> tb.hardMaskedBlocks("chr1")
[(0, 50), (100, 150)]

In this (small) example, there are two stretches of hard-masked sequence, from 0 to 50 and again from 100 to 150 (see the note below about coordinates). If you would instead like to query all blocks overlapping with a specific region, you can specify the region bounds:

>>> tb.hardMaskedBlocks("chr1", 75, 101)
[(100, 150)]

If there are no overlapping regions, then an empty list is returned:

>>> tb.hardMaskedBlocks("chr1", 75, 100)
[]

Instead of hardMaskedBlocks(), one can use softMaskedBlocks() in an identical manner:

>>> tb = py2bit.open("foo.2bit", storeMasked=True)
>>> tb.softMaskedBlocks("chr1")
[(62, 70)]

As shown, you must specify storeMasked=True or you will receive a run time error.

Close a file

A TwoBit object can be closed with the close() method.

>>> tb.close()

A note on coordinates

0-based half-open coordinates are used by this python module. So to access the value for the first base on chr1, one would specify the starting position as 0 and the end position as 1. Similarly, bases 100 to 115 would have a start of 99 and an end of 115. This is simply for the sake of consistency with most other bioinformatics packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py2bit-1.0.1.tar.gz (18.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

py2bit-1.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (50.7 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

py2bit-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (50.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

py2bit-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

py2bit-1.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.7 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

py2bit-1.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.5 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file py2bit-1.0.1.tar.gz.

File metadata

  • Download URL: py2bit-1.0.1.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for py2bit-1.0.1.tar.gz
Algorithm Hash digest
SHA256 4972f85eb3844cdfba43eb54ab3c8349a0536e03dfd7db07ca8d3447285ad20c
MD5 2796de3413432ae6e3351393abec1575
BLAKE2b-256 463b9dedd2e35cebcd4eb3539514b5ac674fd45654404839e6f9eff6e25d67c7

See more details on using hashes here.

File details

Details for the file py2bit-1.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for py2bit-1.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cf81b59b5be9b8e35b06c49025c30b906cc2451923b48b1e7fa7302494e7f320
MD5 01f9bfff4649c505f041f388c5828e68
BLAKE2b-256 59a118fad0ca587eef5ba25a207db5f9730058365825bfdd1fc6650ab2b25609

See more details on using hashes here.

File details

Details for the file py2bit-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for py2bit-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0cbd9754342dc9595f46b18e655af01d89e78d9762b0b541fcd4e96076523275
MD5 d45f6344581b16083659abb73e03092a
BLAKE2b-256 6b676ea73b2173bcb1abe3a947fbd090cf96ee1b0a75f3f690e8902dc8ffdbfa

See more details on using hashes here.

File details

Details for the file py2bit-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for py2bit-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7ff33115185d741d76d5ecd4e8169b9c6705c98f0cac4953b47c7c628e223f31
MD5 2151c3f0a767bfa0c834683735b03914
BLAKE2b-256 777dc4245968c26e7ac22fa72f10d214bcb521bbd29c63431374fc9b2f898d7b

See more details on using hashes here.

File details

Details for the file py2bit-1.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for py2bit-1.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d704a7268cb4762c4473d7cf7e24e49ae5e5b45a1ab620a75303642670a4d284
MD5 5cc2db07565c708e96eb02e8bc0ecf5c
BLAKE2b-256 f93639ac424a99f3bd2c61ba13e1d4c8a904b81f897a8b990847241b224485d9

See more details on using hashes here.

File details

Details for the file py2bit-1.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for py2bit-1.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 558200797c5cee9a23fe4c52379ce0b63eedfbb989177f209853cf67bcab275a
MD5 c844857fc333128ee68af3cbd7b399ca
BLAKE2b-256 188b6666af97479171f9c478ab9abf6fce5beb062f0f2739fb5f5bf064086daf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page