Skip to main content

Container class to represent and operate over genomic regions and annotations.

Project description

Project generated with PyScaffold PyPI-Server Unit tests

GenomicRanges

GenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's GenomicRanges.

Note: V0.4.0 is a complete overhaul of the package, as such the constructor to GenomicRanges has changed. Please refer the documentation for updated usage of the classes and the methods.

To get started, install the package from PyPI

pip install genomicranges

Some of the methods like read_ucsc require optional packages to be installed, e.g. joblib and can be installed by:

pip install genomicranges[optional]

GenomicRanges

GenomicRanges is the base class to represent and operate over genomic regions and annotations.

From Bioinformatic file formats

From biobear

Although the parsing capabilities in this package are limited, the biobear library is designed for reading and searching various bioinformatics file formats, including FASTA, FASTQ, VCF, BAM, and GFF, or from an object store like S3. Users can esily convert these representations to GenomicRanges (or read more here):

from genomicranges import GenomicRanges
import biobear as bb

session = bb.new_session()

df = session.read_gtf_file("path/to/test.gtf").to_polars()
df = df.rename({"seqname": "seqnames", "start": "starts", "end": "ends"})

gg = GenomicRanges.from_polars(df)

# do stuff w/ a genomic ranges
print(len(gg), len(df))
## output
## 77 77

UCSC or GTF file

You can easily download and parse genome annotations from UCSC or load a genome annotation from a GTF file,

import genomicranges

gr = genomicranges.read_gtf(<PATH TO GTF>)
# OR
gr = genomicranges.read_ucsc(genome="hg19")

print(gr)
## output
## GenomicRanges with 1760959 intervals & 10 metadata columns.
## ... truncating the console print ...

From IRanges (Preferred way)

If you have all relevant information to create a GenomicRanges object

from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random

gr = GenomicRanges(
    seqnames=[
        "chr1",
        "chr2",
        "chr3",
        "chr2",
        "chr3",
    ],
    ranges=IRanges(start=[x for x in range(101, 106)], width=[11, 21, 25, 30, 5]),
    strand=["*", "-", "*", "+", "-"],
    mcols=BiocFrame(
        {
            "score": range(0, 5),
            "GC": [random() for _ in range(5)],
        }
    ),
)

print(gr)
## output
GenomicRanges with 5 ranges and 5 metadata columns
    seqnames    ranges           strand     score                  GC
       <str> <IRanges> <ndarray[int64]>   <range>              <list>
[0]     chr1 101 - 112                * |       0  0.2593301003406461
[1]     chr2 102 - 123                - |       1  0.7207993213776644
[2]     chr3 103 - 128                * |       2 0.23391468067222065
[3]     chr2 104 - 134                + |       3  0.7671026589720187
[4]     chr3 105 - 110                - |       4 0.03355777784472458
------
seqinfo(3 sequences): chr1 chr2 chr3

Pandas DataFrame

A common representation in Python is a pandas DataFrame for all tabular datasets. DataFrame must contain columns "seqnames", "starts", and "ends" to represent genomic intervals. Here's an example:

from genomicranges import GenomicRanges
import pandas as pd
from random import random

df = pd.DataFrame(
    {
        "seqnames": ["chr1", "chr2", "chr1", "chr3", "chr2"],
        "starts": [101, 102, 103, 104, 109],
        "ends": [112, 103, 128, 134, 111],
        "strand": ["*", "-", "*", "+", "-"],
        "score": range(0, 5),
        "GC": [random() for _ in range(5)],
    }
)

gr = GenomicRanges.from_pandas(df)
print(gr)
## output
GenomicRanges with 5 ranges and 5 metadata columns
  seqnames    ranges           strand    score                  GC
     <str> <IRanges> <ndarray[int64]>   <list>              <list>
0     chr1 101 - 112                * |      0  0.4862658925128007
1     chr2 102 - 103                - |      1 0.27948386889389953
2     chr1 103 - 128                * |      2  0.5162697718607901
3     chr3 104 - 134                + |      3  0.5979843806415466
4     chr2 109 - 111                - |      4 0.04740781186083798
------
seqinfo(3 sequences): chr1 chr2 chr3

Polars DataFrame

Similarly, To initialize from a polars DataFrame:

from genomicranges import GenomicRanges
import polars as pl
from random import random

df = pl.DataFrame(
    {
        "seqnames": ["chr1", "chr2", "chr1", "chr3", "chr2"],
        "starts": [101, 102, 103, 104, 109],
        "ends": [112, 103, 128, 134, 111],
        "strand": ["*", "-", "*", "+", "-"],
        "score": range(0, 5),
        "GC": [random() for _ in range(5)],
    }
)

gr = GenomicRanges.from_polars(df)
print(gr)
## output
GenomicRanges with 5 ranges and 5 metadata columns
  seqnames    ranges           strand    score                  GC
     <str> <IRanges> <ndarray[int64]>   <list>              <list>
0     chr1 101 - 112                * |      0  0.4862658925128007
1     chr2 102 - 103                - |      1 0.27948386889389953
2     chr1 103 - 128                * |      2  0.5162697718607901
3     chr3 104 - 134                + |      3  0.5979843806415466
4     chr2 109 - 111                - |      4 0.04740781186083798
------
seqinfo(3 sequences): chr1 chr2 chr3

Interval Operations

GenomicRanges supports most interval based operations.

subject = genomicranges.read_ucsc(genome="hg38")

query = genomicranges.from_pandas(
    pd.DataFrame(
        {
            "seqnames": ["chr1", "chr2", "chr3"],
            "starts": [100, 115, 119],
            "ends": [103, 116, 120],
        }
    )
)

hits = subject.nearest(query, ignore_strand=True)
print(hits)
## output
[[0, 1], [1677082, 1677083, 1677084], [1003411, 1003412]]

GenomicRangesList

Just as it sounds, a GenomicRangesList is a named-list like object. If you are wondering why you need this class, a GenomicRanges object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. GenomicRangesList allows us to represent this nested structure.

Currently, this class is limited in functionality.

To construct a GenomicRangesList

from genomicranges import GenomicRanges, GenomicRangesList
from iranges import IRanges
from biocframe import BiocFrame

gr1 = GenomicRanges(
    seqnames=["chr1", "chr2", "chr1", "chr3"],
    ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
    strand=["-", "+", "*", "+"],
    mcols=BiocFrame({"score": [1, 2, 3, 4]}),
)

gr2 = GenomicRanges(
    seqnames=["chr2", "chr4", "chr5"],
    ranges=IRanges([3, 6, 4], [30, 50, 60]),
    strand=["-", "+", "*"],
    mcols=BiocFrame({"score": [2, 3, 4]}),
)
grl = GenomicRangesList(ranges=[gr1, gr2], names=["gene1", "gene2"])
print(grl)
## output
GenomicRangesList with 2 ranges and 2 metadata columns

Name: gene1
GenomicRanges with 4 ranges and 4 metadata columns
    seqnames    ranges           strand    score
       <str> <IRanges> <ndarray[int64]>   <list>
[0]     chr1    1 - 11                - |      1
[1]     chr2    3 - 33                + |      2
[2]     chr1    2 - 52                * |      3
[3]     chr3    4 - 64                + |      4
------
seqinfo(3 sequences): chr1 chr2 chr3

Name: gene2
GenomicRanges with 3 ranges and 3 metadata columns
    seqnames    ranges           strand    score
       <str> <IRanges> <ndarray[int64]>   <list>
[0]     chr2    3 - 33                - |      2
[1]     chr4    6 - 56                + |      3
[2]     chr5    4 - 64                * |      4
------
seqinfo(3 sequences): chr2 chr4 chr5

Further information

Note

This project has been set up using PyScaffold 4.1.1. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomicranges-0.4.32.tar.gz (68.9 kB view details)

Uploaded Source

Built Distribution

GenomicRanges-0.4.32-py3-none-any.whl (37.2 kB view details)

Uploaded Python 3

File details

Details for the file genomicranges-0.4.32.tar.gz.

File metadata

  • Download URL: genomicranges-0.4.32.tar.gz
  • Upload date:
  • Size: 68.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for genomicranges-0.4.32.tar.gz
Algorithm Hash digest
SHA256 0b05d4b07c758a9f5cb63a6389758dd78295ed4c1ced34f42803d6d4144a9dc5
MD5 ee5f9386e86f6b6ddd36717abbaff201
BLAKE2b-256 759a7f039ffd1093947aca8dc21ca18afd52ecf7626c518a7f740ef4c6392652

See more details on using hashes here.

File details

Details for the file GenomicRanges-0.4.32-py3-none-any.whl.

File metadata

File hashes

Hashes for GenomicRanges-0.4.32-py3-none-any.whl
Algorithm Hash digest
SHA256 a80deed03c55a17bab189b9d6a507049f3c73aa49762af977e8bb94e238b92cd
MD5 6364fb7eee312f396fcb488e1d0a0833
BLAKE2b-256 98b835e403b2bafde4dd84a2501fc53a708fd67e5b3e3c22d8b731fb54623b40

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page