Skip to main content

BioRAG: A tool for textual and gene set search against ARCHS4 data

Project description

BioRAG codecov workflow

This is the repository for BioRAG[^1]. BioRAG can identify relevant studies within the ARCHS4 database, using a text-based query or a gene set.

Image Description

Table of Contents

Installation

BioRAG has been tested on python 3.9, 3.10, and 3.11.

Install BioRAG via PyPI with the following command:

pip install biorag

Additional files

These additional files will need to be downloaded. Below is a table containing a description of these files, and the links.

File Name Description Size checksum
semantic_db.h5ad semantic vector store 58Mb bd23f8835032dcd709f8ca05915038b3
transcriptomic_db.h5ad transcriptomic vector store 38GB d26e564653424ed6e189aea7d0be5d4d
human_gene_v2.2.h5 ARCHS4[^2] hdf5 database 37GB f546fdecc0ba20bf2a5571628f132ca5

Both the semantic vector stores and the transcriptomic vector stores are AnnData[^3] objects, containing both the embedding matrices and the indices which link each embedding vector to an experiment in the ARCHS4 database. Additionally, the transcriptomic vector store also contains derived count data of "representative transcriptomes" in the ARCHS4 database.

Usage

To load BioRAG, follow these steps:

  1. Download the additional files described above.

  2. Instantiate a query_db object to perform BioRAG search in following way:

    from biorag.biorag import Query_DB
    
    new_query_db = Query_DB(SEMANTIC_VECTOR_STORE_PATH, TRANSCRIPTOME_VECTOR_STORE_PATH, ARCHS4_HDF5_DATABASE_PATH)
    
  3. Run BioRAG with a gene set query and/or textual query. The gene set is a python list, and the textual query is a python string.

    text_query = "This text query usually describes the experiment"
    
    geneset_query = ["IFNG", "IRF1", "IFR2"]
    
    result = new_query_db.search(geneset = geneset_query, text_query = text_query)
    
  4. The output is a Results object, which is composed of three pandas dataframes, which can be accessed via dot notation.

  • The "seed_studies" variable holds a dataframe containing study metadata from the search step.
  • The "expansion_studies" holds a dataframe of studies from the expansion step,
  • The "samples" variable holds a dataframe of the relevant samples, with metadata derived from the ARCHS4 database.
    result = new_query_db.search(geneset = geneset_query, text_query = text_query)

    # save the results as csv files
    result.seed_studies.to_csv("relevant_seed_studies.csv")
    result.expansion_studies.to_csv("relevant_expansion_studies.csv")
    result.samples.to_csv("relevant_samples.csv")

Modifying searches using different inputs

All BioRAG searches should contain at least a text query or a gene set query. Users can choose one of several search strategies, with examples illustrated below:

  • Strategy 1: Text only as input

    result = new_query_db.search(text_query = text_query, geneset = None)
    
  • Strategy 2: Text and gene set as input

    result = new_query_db.search(text_query = text_query, geneset = geneset_query)
    
  • Strategy 3: Gene set only as input

    result = new_query_db.search(text_query = None, geneset = geneset_query)
    

Modifying searches using different expansion strategies

By default, BioRAG will perform semantic search, followed by transcriptomic expansion. However, this search strategy can be modified. For instance, one can use semantic search, followed by perfoming the expansion step using semantic similiarity. In this case, the transcriptome vector store is not queried.

The "seed" and "expand" parameters accepts either "transcriptome" or "semantic" as options.

If transcriptome search is performed as the initial step, the count data from "representative transcriptomes" in the transcriptome vector store is used to search for enriched samples, using ssGSEA (single sample gene set enrichment analysis) to rank the most relevant samples. In semantic search, BioRAG retrieves the most semantically similar studies to the text query, using cosine distance as the metric.

  1. Example 1 - perform semantic search, followed by transcriptome expansion.

    result = new_query_db.search(text_query = text_query, geneset = geneset_query, search = "semantic", expand = "transcriptome")
    
  2. Example 2 - perform transcriptome search, followed by semantic expansion.

    result = new_query_db.search(text_query = text_query, geneset = geneset_query, search = "transcriptome", expand = "semantic")
    
  3. Example 3 - perform semantic search, followed by semantic expansion, using only a text query.

    result = new_query_db.search(text_query = text_query, geneset = None, search = "semantic", expand = "semantic")
    

The types of searches which can be performed depend on input. For instance,if only a gene set is supplied, the search step defaults to "transcriptome", with the user able to select between "transcriptome" and "semantic" for the expansion step.

Optional single sample gene set enrichment analysis (ssGSEA)

To further refine the set of samples and studies returned by BioRAG search, ssGSEA can be peformed on all samples returned by the query. Use the "perform_enrichment" parameter to specifiy if ssGSEA should be performed on all samples. If so, the returned dataframe will contain enrichment scores, pvalues and FDRs. The ssGSEA results will be stored as a dataframe in the "samples" attribute in the Results object.

    result = new_query_db.search(text_query = text_query, geneset = None, search = "semantic", expand = "semantic", perform_enrichment = True)

    # save enrichment results from the sample dataframe
    results.samples.to_csv("samples_with_ssgsea_results.csv")

License

Creative Commons Attribution 4.0 International. The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited.

References

[^1]: Chin WL, & Lassmann, T. (2024). Language models improve the discovery of public RNA-seq data. bioRxiv. [^2]: Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma'ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun. 2018 Apr 10;9(1):1366. doi: 10.1038/s41467-018-03751-6. PMID: 29636450; PMCID: PMC5893633. [^3]: [AnnData] (https://anndata.readthedocs.io/en/latest/)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biorag-0.1.5.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

biorag-0.1.5-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file biorag-0.1.5.tar.gz.

File metadata

  • Download URL: biorag-0.1.5.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.18 Linux/5.4.0-170-generic

File hashes

Hashes for biorag-0.1.5.tar.gz
Algorithm Hash digest
SHA256 4f3fd2dc73e4850d4284af10937be0d0a5027de49af6c2d553678a9e421c5a08
MD5 eb024f81529f080877434b91d18268fd
BLAKE2b-256 cf9f7faf5d71da404e6874d1fc081090c96fd6e8e03bd6329b33cd53c92c0068

See more details on using hashes here.

File details

Details for the file biorag-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: biorag-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.18 Linux/5.4.0-170-generic

File hashes

Hashes for biorag-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 98a5f71d3a57d83623ceff7bf6dd2d8ab3e62993dd5fa6bfc68a3784675aef3f
MD5 bb2f6bd66ae244d184df10d9fd557a95
BLAKE2b-256 861c9fd26c450ab356525ad00670994da12ff58627de1da20d18e76b23260ee4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page