Container class to represent and operate over genomic regions and annotations.
Project description
GenomicRanges
GenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's GenomicRanges.
To get started, install the package from PyPI
pip install genomicranges
Some of the methods like read_ucsc require optional packages to be installed, e.g. joblib and can be installed by:
pip install genomicranges[optional]
GenomicRanges
GenomicRanges is the base class to represent and operate over genomic regions and annotations.
From Bioinformatic file formats
[!NOTE] When reading genomic formats,
endsare expected to be inclusive to be consistent with Bioconductor representations (& gff). If they are not, we recommend subtracting 1 from theends.
From biobear
Although the parsing capabilities in this package are limited, the biobear library is designed for reading and searching various bioinformatics file formats, including FASTA, FASTQ, VCF, BAM, and GFF, or from an object store like S3. Users can esily convert these representations to GenomicRanges (or read more here):
from genomicranges import GenomicRanges
import biobear as bb
session = bb.new_session()
df = session.read_gtf_file("path/to/test.gtf").to_polars()
df = df.rename({"seqname": "seqnames", "start": "starts", "end": "ends"})
gg = GenomicRanges.from_polars(df)
# do stuff w/ a genomic ranges
print(len(gg), len(df))
## output
## 77 77> [!NOTE]
endsare expected to be inclusive to be consistent with Bioconductor representations. If they are not, we recommend subtracting 1 from theends.
UCSC or GTF file
You can easily download and parse genome annotations from UCSC or load a genome annotation from a GTF file,
import genomicranges
gr = genomicranges.read_gtf(<PATH TO GTF>)
# OR
gr = genomicranges.read_ucsc(genome="hg19")
print(gr)
## output
## GenomicRanges with 1760959 intervals & 10 metadata columns.
## ... truncating the console print ...
From IRanges (Preferred way)
If you have all relevant information to create a GenomicRanges object
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
gr = GenomicRanges(
seqnames=[
"chr1",
"chr2",
"chr3",
"chr2",
"chr3",
],
ranges=IRanges(start=[x for x in range(101, 106)], width=[11, 21, 25, 30, 5]),
strand=["*", "-", "*", "+", "-"],
mcols=BiocFrame(
{
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
),
)
print(gr)
## output
GenomicRanges with 5 ranges and 5 metadata columns
seqnames ranges strand score GC
<str> <IRanges> <ndarray[int64]> <range> <list>
[0] chr1 101 - 111 * | 0 0.2593301003406461
[1] chr2 102 - 122 - | 1 0.7207993213776644
[2] chr3 103 - 127 * | 2 0.23391468067222065
[3] chr2 104 - 133 + | 3 0.7671026589720187
[4] chr3 105 - 109 - | 4 0.03355777784472458
------
seqinfo(3 sequences): chr1 chr2 chr3
Pandas DataFrame
A common representation in Python is a pandas DataFrame for all tabular datasets. DataFrame must contain columns "seqnames", "starts", and "ends" to represent genomic intervals. Here's an example:
from genomicranges import GenomicRanges
import pandas as pd
from random import random
df = pd.DataFrame(
{
"seqnames": ["chr1", "chr2", "chr1", "chr3", "chr2"],
"starts": [101, 102, 103, 104, 109],
"ends": [112, 103, 128, 134, 111],
"strand": ["*", "-", "*", "+", "-"],
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
)
gr = GenomicRanges.from_pandas(df)
print(gr)
## output
GenomicRanges with 5 ranges and 5 metadata columns
seqnames ranges strand score GC
<str> <IRanges> <ndarray[int64]> <list> <list>
0 chr1 101 - 111 * | 0 0.4862658925128007
1 chr2 102 - 102 - | 1 0.27948386889389953
2 chr1 103 - 127 * | 2 0.5162697718607901
3 chr3 104 - 133 + | 3 0.5979843806415466
4 chr2 109 - 110 - | 4 0.04740781186083798
------
seqinfo(3 sequences): chr1 chr2 chr3
Polars DataFrame
Similarly, To initialize from a polars DataFrame:
from genomicranges import GenomicRanges
import polars as pl
from random import random
df = pl.DataFrame(
{
"seqnames": ["chr1", "chr2", "chr1", "chr3", "chr2"],
"starts": [101, 102, 103, 104, 109],
"ends": [112, 103, 128, 134, 111],
"strand": ["*", "-", "*", "+", "-"],
"score": range(0, 5),
"GC": [random() for _ in range(5)],
}
)
gr = GenomicRanges.from_polars(df)
print(gr)
## output
GenomicRanges with 5 ranges and 5 metadata columns
seqnames ranges strand score GC
<str> <IRanges> <ndarray[int64]> <list> <list>
0 chr1 101 - 112 * | 0 0.4862658925128007
1 chr2 102 - 103 - | 1 0.27948386889389953
2 chr1 103 - 128 * | 2 0.5162697718607901
3 chr3 104 - 134 + | 3 0.5979843806415466
4 chr2 109 - 111 - | 4 0.04740781186083798
------
seqinfo(3 sequences): chr1 chr2 chr3
Interval Operations
GenomicRanges supports most interval based operations.
subject = genomicranges.read_ucsc(genome="hg38")
query = genomicranges.from_pandas(
pd.DataFrame(
{
"seqnames": ["chr1", "chr2", "chr3"],
"starts": [100, 115, 119],
"ends": [103, 116, 120],
}
)
)
hits = subject.nearest(query, ignore_strand=True, select="all")
print(hits)
## output
BiocFrame with 3 rows and 2 columns
query_hits self_hits
<ndarray[int32]> <ndarray[int32]>
[0] 0 0
[1] 1 1677082
[2] 2 1003411
CompressedGenomicRangesList
Just as it sounds, a CompressedGenomicRangesList is a named-list like object. If you are wondering why you need this class, a GenomicRanges object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. CompressedGenomicRangesList allows us to represent this nested structure.
Currently, this class is limited in functionality.
To construct a CompressedGenomicRangesList
from genomicranges import GenomicRanges, CompressedGenomicRangesList
from iranges import IRanges
from biocframe import BiocFrame
gr1 = GenomicRanges(
seqnames=["chr1", "chr2", "chr1", "chr3"],
ranges=IRanges([1, 3, 2, 4], [10, 30, 50, 60]),
strand=["-", "+", "*", "+"],
mcols=BiocFrame({"score": [1, 2, 3, 4]}),
)
gr2 = GenomicRanges(
seqnames=["chr2", "chr4", "chr5"],
ranges=IRanges([3, 6, 4], [30, 50, 60]),
strand=["-", "+", "*"],
mcols=BiocFrame({"score": [2, 3, 4]}),
)
grl = CompressedGenomicRangesList.from_list(lst=[gr1, gr2], names=["gene1", "gene2"])
print(grl)
## output
CompressedGenomicRangesList with 2 ranges and 2 metadata columns
Name: gene1
GenomicRanges with 4 ranges and 4 metadata columns
seqnames ranges strand score
<str> <IRanges> <ndarray[int64]> <list>
[0] chr1 1 - 10 - | 1
[1] chr2 3 - 32 + | 2
[2] chr1 2 - 51 * | 3
[3] chr3 4 - 63 + | 4
------
seqinfo(3 sequences): chr1 chr2 chr3
Name: gene2
GenomicRanges with 3 ranges and 3 metadata columns
seqnames ranges strand score
<str> <IRanges> <ndarray[int64]> <list>
[0] chr2 3 - 32 - | 2
[1] chr4 6 - 55 + | 3
[2] chr5 4 - 63 * | 4
------
seqinfo(3 sequences): chr2 chr4 chr5
Performance
Performance comparison between Python and R GenomicRanges implementations. The query dataset contains approximately 564,000 intervals, while the subject dataset contains approximately 71 million intervals.
| Operation | Python/GenomicRanges | Python/GenomicRanges (5 threads) | R/GenomicRanges |
|---|---|---|---|
| Overlap | 2.80s | 2.06s | 4.40s |
| Overlap (single chromosome) | 6.73s | 5.19s | 10.06s |
| Nearest | 2.27s | 1.5s | 42.16s |
| Nearest (single chromosome) | 4.7s | 4.67s | 11.01s |
[!NOTE] The single chromosome benchmark ignores chromosome/sequence information and performs overlap operations solely on intervals.
For details, see the scripts in the benchmark directory.
Further information
Note
This project has been set up using PyScaffold 4.1.1. For details and usage information on PyScaffold see https://pyscaffold.org/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genomicranges-0.8.4.tar.gz.
File metadata
- Download URL: genomicranges-0.8.4.tar.gz
- Upload date:
- Size: 77.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0143ef28ff4dc801a93becc9ec9fc63a4327e98ef30419245662b9067b244578
|
|
| MD5 |
c535a79d72f02f25e10438207c513d2b
|
|
| BLAKE2b-256 |
123e5575d92439b27072e6d5e9443c9e3efa0ed34678cb39a4ccda36c87e4542
|
Provenance
The following attestation bundles were made for genomicranges-0.8.4.tar.gz:
Publisher:
publish-pypi.yml on BiocPy/GenomicRanges
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genomicranges-0.8.4.tar.gz -
Subject digest:
0143ef28ff4dc801a93becc9ec9fc63a4327e98ef30419245662b9067b244578 - Sigstore transparency entry: 832631086
- Sigstore integration time:
-
Permalink:
BiocPy/GenomicRanges@db826b21cd885929a5de998d9883fdf3ac9a3f41 -
Branch / Tag:
refs/tags/0.8.4 - Owner: https://github.com/BiocPy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@db826b21cd885929a5de998d9883fdf3ac9a3f41 -
Trigger Event:
push
-
Statement type:
File details
Details for the file genomicranges-0.8.4-py3-none-any.whl.
File metadata
- Download URL: genomicranges-0.8.4-py3-none-any.whl
- Upload date:
- Size: 39.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4136c237ac01a653c9ed48797f3c1d70c9b84ac33fe409d4bff9226fe5ddbf64
|
|
| MD5 |
73760dc019fc5fcad8e8050b4c0aeead
|
|
| BLAKE2b-256 |
afc8bb39278bec00559bef2b5611c3f25c368005110f0513c92f6c9634a95364
|
Provenance
The following attestation bundles were made for genomicranges-0.8.4-py3-none-any.whl:
Publisher:
publish-pypi.yml on BiocPy/GenomicRanges
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genomicranges-0.8.4-py3-none-any.whl -
Subject digest:
4136c237ac01a653c9ed48797f3c1d70c9b84ac33fe409d4bff9226fe5ddbf64 - Sigstore transparency entry: 832631089
- Sigstore integration time:
-
Permalink:
BiocPy/GenomicRanges@db826b21cd885929a5de998d9883fdf3ac9a3f41 -
Branch / Tag:
refs/tags/0.8.4 - Owner: https://github.com/BiocPy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@db826b21cd885929a5de998d9883fdf3ac9a3f41 -
Trigger Event:
push
-
Statement type: