Add a short description here!
Project description
Gene set selections in Python
Overview
The gesel package provides a Python interface to the Gesel database for client-side gene set searches. The idea is to use HTTP range requests to serve the gene-to-set mappings to a client without downloading the entire database or implementing custom server logic. In this manner, we can execute a variety of interesting gene set queries with high scalability across users and minimal backend maintenance.
Quick start
To get started, install the package from PyPI:
pip install gesel
Then we can find overlaps between our genes of interest and the gene sets in the Gesel database. (Note that the exact numbers are subject to change, pending updates to the version of the Gesel database.)
import gesel
my_genes = ["SNAP25", "NEUROD6", "GAD1", "GAD2", "RELN"]
# First, mapping our gene names to Gesel's internal gene indices.
gene_idx = gesel.search_genes("9606", my_genes) # list of lists of gene indices.
print(gene_idx)
## [[4639], [12767], [1758], [1759], [3912]]
gene_idx = sum(gene_idx, []) # collapsing it to a list of integers, for simplicity.
print(gesel.fetch_all_genes("9606")[gene_idx,:]) # double-checking that we got it right.
## BiocFrame with 5 rows and 3 columns
## symbol entrez ensembl
## <list> <list> <list>
## [0] ['SNAP25'] ['6616'] ['ENSG00000132639']
## [1] ['NEUROD6'] ['63974'] ['ENSG00000164600']
## [2] ['GAD1'] ['2571'] ['ENSG00000128683']
## [3] ['GAD2'] ['2572'] ['ENSG00000136750']
## [4] ['RELN'] ['5649'] ['ENSG00000189056']
# Now finding all sets with one or more overlaps to `my_genes`.
overlaps, present = gesel.find_overlapping_sets("9606", gene_idx, counts_only=False)
print(overlaps) # set index and the identities of overlapping genes.
## BiocFrame with 1163 rows and 2 columns
## set genes
## <list> <list>
## [0] 2420 [4639, 1758, 1759, 3912]
## [1] 2521 [4639, 1758, 1759, 3912]
## [2] 21748 [4639, 12767, 1758, 1759]
## ... ...
## [1160] 40562 [3912]
## [1161] 40597 [3912]
## [1162] 40599 [3912]
# Actually getting the identities of the top sets:
set_info = gesel.fetch_some_sets("9606", overlaps["set"][:10])
print(set.info)
## BiocFrame with 10 rows and 5 columns
## name description size collection number
## <list> <list> <list> <list> <list>
## [0] GO:0005737 cytoplasm 5010 0 2420
## [1] GO:0005886 plasma membrane 4778 0 2521
## [2] BLALOCK_ALZHEIMERS_D... http://www.gsea-msig... 1248 2 2515
## [3] MANNO_MIDBRAIN_NEURO... http://www.gsea-msig... 1106 14 92
## [4] GO:0005515 protein binding 12505 0 2310
## [5] GO:0007268 chemical synaptic tr... 248 0 3331
## [6] MIKKELSEN_MEF_HCP_WI... http://www.gsea-msig... 591 2 1896
## [7] REACTOME_NEUROTRANSM... http://www.gsea-msig... 51 4 29
## [8] REACTOME_TRANSMISSIO... http://www.gsea-msig... 270 4 32
## [9] REACTOME_NEURONAL_SYSTEM http://www.gsea-msig... 411 4 33
# As well as the collections from which they were derived.
print(gesel.fetch_some_collections("9606", [0, 2, 4, 14]))
## BiocFrame with 4 rows and 6 columns
## title description maintainer source start size
## <list> <list> <list> <list> <list> <list>
## [0] Gene ontology Gene sets defined fr... Aaron Lun https://github.com/L... 0 18933
## [1] MSigDB chemical and ... Gene sets that repre... Aaron Lun https://github.com/L... 19233 3405
## [2] MSigDB canonical pat... Reactome gene sets a... Aaron Lun https://github.com/L... 22834 1654
## [3] MSigDB cell type sig... Gene sets that conta... Aaron Lun https://github.com/L... 39774 830
Check out the reference documentation for more details.
Searching on text
We can also search for gene sets based on the text in their names or descriptions.
chits = gesel.search_set_text("9606", "cancer")
print(gesel.fetch_some_sets("9606", chits[:10]))
## BiocFrame with 10 rows and 5 columns
## name description size collection number
## <list> <list> <list> <list> <list>
## [0] SOGA_COLORECTAL_CANC... http://www.gsea-msig... 71 2 1
## [1] SOGA_COLORECTAL_CANC... http://www.gsea-msig... 82 2 2
## [2] WATANABE_RECTAL_CANC... http://www.gsea-msig... 113 2 64
## [3] LIU_PROSTATE_CANCER_UP http://www.gsea-msig... 99 2 66
## [4] BERTUCCI_MEDULLARY_V... http://www.gsea-msig... 207 2 68
## [5] WATANABE_COLON_CANCE... http://www.gsea-msig... 29 2 78
## [6] WATANABE_COLON_CANCE... http://www.gsea-msig... 69 2 79
## [7] SOTIRIOU_BREAST_CANC... http://www.gsea-msig... 53 2 81
## [8] CHARAFE_BREAST_CANCE... http://www.gsea-msig... 52 2 124
## [9] DOANE_BREAST_CANCER_... http://www.gsea-msig... 33 2 125
ihits = gesel.search_set_text("9606", "innate immun*")
print(gesel.fetch_some_sets("9606", ihits[:10]))
## BiocFrame with 10 rows and 5 columns
## name description size collection number
## <list> <list> <list> <list> <list>
## [0] REACTOME_INNATE_IMMU... http://www.gsea-msig... 1118 4 229
## [1] REACTOME_REGULATION_... http://www.gsea-msig... 15 4 552
## [2] REACTOME_SARS_COV_1_... http://www.gsea-msig... 41 4 1562
## [3] REACTOME_SARS_COV_2_... http://www.gsea-msig... 126 4 1580
## [4] WP_SARSCOV2_B117_VAR... http://www.gsea-msig... 9 5 185
## [5] WP_SARS_CORONAVIRUS_... http://www.gsea-msig... 31 5 189
## [6] WP_SARSCOV2_INNATE_I... http://www.gsea-msig... 66 5 318
## [7] WP_PATHWAYS_OF_NUCLE... http://www.gsea-msig... 16 5 729
## [8] GO:0002218 activation of innate... 32 0 813
## [9] GO:0002220 innate immune respon... 1 0 814
thits = gesel.search_set_text("9606", "cd? t cell")
print(gesel.fetch_some_sets("9606", thits[:10]))
## name description size collection number
## <list> <list> <list> <list> <list>
## [0] HOFT_CD4_POSITIVE_AL... http://www.gsea-msig... 47 13 49
## [1] HOFT_CD4_POSITIVE_AL... http://www.gsea-msig... 28 13 131
## [2] HOFT_CD4_POSITIVE_AL... http://www.gsea-msig... 40 13 204
## [3] HOFT_CD4_POSITIVE_AL... http://www.gsea-msig... 16 13 252
## [4] HOFT_CD4_POSITIVE_AL... http://www.gsea-msig... 24 13 296
## [5] HOFT_CD4_POSITIVE_AL... http://www.gsea-msig... 42 13 316
## [6] HOFT_CD4_POSITIVE_AL... http://www.gsea-msig... 41 13 320
## [7] HOFT_CD4_POSITIVE_AL... http://www.gsea-msig... 6 13 323
## [8] QI_CD4_POSITIVE_ALPH... http://www.gsea-msig... 9 13 327
## [9] QI_NAIVE_T_CELL_ZOST... http://www.gsea-msig... 7 13 328
Users can construct powerful queries by intersecting the sets recovered from search_set_text() with those from find_overlapping_sets().
import biocutils
cancer_sets = biocutils.intersect(chits, overlaps["set"])
info = gesel.fetch_some_sets("9606", cancer_sets)
m = biocutils.match(cancer_sets, overlaps["set"])
info = info.set_column("count", [len(overlaps["genes"][r]) for r in m])
# We'll just use the proportion of enriched genes for ranking here;
# a more sophisticated analysis might compute a hypergeometric p-value.
prop = [info["count"][i] / info["size"][i] for i in range(info.shape[0])]
ordered = info[biocutils.order(prop, decreasing=True),:]
print(ordered)
## BiocFrame with 14 rows and 6 columns
## name description size collection number count
## <list> <list> <list> <list> <list> <list>
## [0] LOPES_METHYLATED_IN_... http://www.gsea-msig... 28 2 903 1
## [1] SCHLESINGER_H3K27ME3... http://www.gsea-msig... 28 2 2864 1
## [2] WATANABE_COLON_CANCE... http://www.gsea-msig... 29 2 78 1
## ... ... ... ... ... ...
## [11] ACEVEDO_LIVER_CANCER_DN http://www.gsea-msig... 540 2 3150 1
## [12] SMID_BREAST_CANCER_L... http://www.gsea-msig... 587 2 991 1
## [13] LIU_OVARIAN_CANCER_T... http://www.gsea-msig... 1713 2 2003 1
Fetching all data
gesel is designed around partial extraction from the database files, but it may be more efficient to pull all of the data into memory at once. This is most useful for the gene set details, which can be retrieved en masse:
set_info = gesel.fetch_all_sets("9606")
print(set_info)
## BiocFrame with 40654 rows and 5 columns
## name description size collection number
## <list> <list> <list> <list> <list>
## [0] GO:0000002 mitochondrial genome... 11 0 0
## [1] GO:0000003 reproduction 4 0 1
## [2] GO:0000009 alpha-1,6-mannosyltr... 2 0 2
## ... ... ... ... ...
## [40651] HALLMARK_KRAS_SIGNAL... http://www.gsea-msig... 200 15 47
## [40652] HALLMARK_KRAS_SIGNAL... http://www.gsea-msig... 200 15 48
## [40653] HALLMARK_PANCREAS_BE... http://www.gsea-msig... 40 15 49
The set indices returned by other functions like find_overlapping_sets() can then be used to directly subset the set_info data frame by row.
In fact, calling fetch_some_sets() after fetch_all_sets() will automatically use the data frame created by the latter,
instead of attempting another request to the database.
The same approach can be used to extract collection information, via fetch_all_collections();
gene set membership, via fetch_genes_for_all_sets();
and the sets containing each gene, via fetch_sets_for_all_genes().
Advanced use
gesel uses a lot of in-memory caching to reduce the number of requests to the database files within a single Python session. On rare occasions, the cache may become outdated, e.g., if the database files are updated while an Python session is running. Users can prompt gesel to re-acquire all data by flusing the cache:
gesel.flush_memory_cache()
Applications can specify their own functions for obtaining files (or byte ranges thereof) by passing a custom config= in each gesel function.
For example, on a shared HPC filesystem, we could point gesel towards a directory of Gesel database files.
This provides a performant alternative to HTTP requests for an institutional collection of gene sets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gesel-0.1.0.tar.gz.
File metadata
- Download URL: gesel-0.1.0.tar.gz
- Upload date:
- Size: 45.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdf31988da5362a65236c95ff0cf6b791bf1d3f1e3bf49d718d2040e3c469fb1
|
|
| MD5 |
0ead1b1635631b1939b3a48a644201d6
|
|
| BLAKE2b-256 |
cc251817b10faaa04def329836e75e47388744015d3aa732228ba7aef06a94d3
|
Provenance
The following attestation bundles were made for gesel-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on gesel-inc/gesel-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gesel-0.1.0.tar.gz -
Subject digest:
fdf31988da5362a65236c95ff0cf6b791bf1d3f1e3bf49d718d2040e3c469fb1 - Sigstore transparency entry: 1241130318
- Sigstore integration time:
-
Permalink:
gesel-inc/gesel-py@3e17128719023e3f1d238faaaf0812c36822ba53 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/gesel-inc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@3e17128719023e3f1d238faaaf0812c36822ba53 -
Trigger Event:
push
-
Statement type:
File details
Details for the file gesel-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gesel-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d79c3d417c85b3bcfeb41dbda8f240337f227486cbe01c5ae21424db14a782e9
|
|
| MD5 |
c13cb4b3de69264c366a5236cb30e4a9
|
|
| BLAKE2b-256 |
1770e6ab57c85f6ebe9d8ab7a279673b3378525865ef2e4d95d7e1fbffa36fb7
|
Provenance
The following attestation bundles were made for gesel-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on gesel-inc/gesel-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gesel-0.1.0-py3-none-any.whl -
Subject digest:
d79c3d417c85b3bcfeb41dbda8f240337f227486cbe01c5ae21424db14a782e9 - Sigstore transparency entry: 1241130364
- Sigstore integration time:
-
Permalink:
gesel-inc/gesel-py@3e17128719023e3f1d238faaaf0812c36822ba53 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/gesel-inc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@3e17128719023e3f1d238faaaf0812c36822ba53 -
Trigger Event:
push
-
Statement type: