Skip to main content

A fast, pure-Python package for reading .2bit files (used by the UCSC genome browser)

Project description

twobitreader

CI Lint Docs

twobitreader is a small, fast Python package for reading UCSC .2bit genome files. It supports random access by sequence name and genomic interval, making it useful for pulling slices from large genome files without loading whole chromosomes into memory.

The package reads .2bit files only; it does not write them.

Performance in v4

Version 4 keeps decoding pure Python while reducing startup cost and speeding up common slice paths. The main changes are lazy construction of the large two-byte lookup table, faster N-block lookup with bisect, and decoded sequence buffers backed by plain Python character lists instead of deprecated array('u') buffers.

Benchmarks below compare v4.0.0 with v3.1.8 on Python 3.14.5, using synthetic 5 Mb .2bit files. The v3.1.9 tag has the same reader implementation as v3.1.8, plus release/CI packaging changes.

v4 import performance

v4 slice speedups

Benchmark v3.1.8 v4.0.0 Change
Cold import time 179.6 ms 35.6 ms 5.0x faster
Peak import memory 14.18 MB 2.22 MB 6.4x less
Plain 1 Mb slice 135.6 ms 17.3 ms 7.8x faster
10 bp slice with 50k N-blocks 0.749 ms 0.0026 ms 290x faster

Installation

Install the latest released package from PyPI:

pip install twobitreader

For local development, clone the repository and install it in editable mode:

git clone https://github.com/benjschiller/twobitreader.git
cd twobitreader
pip install -e ".[dev,docs]"
pre-commit install

Python Usage

Open a .2bit file with TwoBitFile. It behaves like a dictionary whose keys are sequence names and whose values are sliceable sequence objects.

from twobitreader import TwoBitFile

with TwoBitFile("hg19.2bit") as genome:
    print(genome.keys())
    print(genome.sequence_sizes()["chr1"])

    sequence = genome["chr1"][100_000:100_050]
    print(sequence)

Coordinates follow Python and UCSC BED conventions: they are 0-based and end-open. For example, genome["chr1"][10:20] returns 10 bases.

Converting an entire chromosome to a string works, but can use a lot of memory:

with TwoBitFile("hg19.2bit") as genome:
    chr_m = str(genome["chrM"])

Command-Line Usage

twobitreader can also read BED-style intervals from standard input and write FASTA records to standard output:

python -m twobitreader genome.2bit < regions.bed > regions.fa

Input lines should have at least three whitespace-separated fields:

chrom    start    end
chr1     100000   100050
chr2     250      300

Invalid regions are skipped with warnings written to standard error. Intervals that extend past the end of a sequence are truncated.

Downloading Genomes

The twobitreader.download module can fetch .2bit genomes from UCSC:

python -m twobitreader.download hg19

Please follow UCSC's usage guidelines and avoid excessive automated downloads.

Development

Run the full test suite with:

python3 -m unittest discover -s tests

Run the lightweight package smoke test with:

python3 test_package.py

Build the package with:

python3 -m build

Build the Sphinx documentation with:

sphinx-build -W --keep-going -b html doc doc/_build/html

Run formatting and repository checks with:

pre-commit run --all-files

The Makefile uses python in a few targets. If your environment only provides python3, run the equivalent command directly with python3.

License

twobitreader is licensed under the Perl Artistic License 2.0. See LICENSE.txt and COPYRIGHT for details.

No warranty is provided, express or implied.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twobitreader-4.0.0.tar.gz (30.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

twobitreader-4.0.0-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file twobitreader-4.0.0.tar.gz.

File metadata

  • Download URL: twobitreader-4.0.0.tar.gz
  • Upload date:
  • Size: 30.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for twobitreader-4.0.0.tar.gz
Algorithm Hash digest
SHA256 ce8ad82f64745a82e63e668f94fbdccbc16f233255b98991b3f74102666c22c7
MD5 7c709078ef3725c7377e85936eff6eeb
BLAKE2b-256 095f6f2743eb321e647a348d035510900356de1138c2568a01ceeac02f901c1d

See more details on using hashes here.

File details

Details for the file twobitreader-4.0.0-py3-none-any.whl.

File metadata

  • Download URL: twobitreader-4.0.0-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for twobitreader-4.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e1252414a52fa9fcb50a3e54a3980547d558d0c953e0e7bd7e6d90369008bd3
MD5 2fcbb29b474de5867bdd5ea4af766b23
BLAKE2b-256 3c45542256a04e4015d08822787fe232322e0784b8614ab83c6c17097697ebc0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page