Skip to main content

SampleExplorer: A tool for textual and gene set search against ARCHS4 data

Project description

SampleExplorer codecov workflow version license

This is the repository for SampleExplorer. SampleExplorer can identify relevant studies within the ARCHS4 database, using a text-based query or a gene set.

Image Description

Table of Contents

Quickstart

The easiest way to access the main functions of SampleExplorer is via a containerized Streamlit app. If you have Docker installed, run:

docker run -p 8501:8501 wlc27/streamlit_sample_explorer:0.1.9

Once the container starts, it will expose the Streamlit app on port 8501 of your local machine. Open your browser and navigate to:

http://localhost:8501

Note: On first run, the container may take 5-10 minutes to initialize. This includes downloading the BERT model for semantic queries. Please allow the process to complete without interruption. Subsequent starts from the same container will be significantly faster.

You can perform gene set enrichment on this output as described in the section below.

Installation

For those wanting to use all functions for this software, please install the python-based application.

SampleExplorer has been tested on python 3.9, 3.10, and 3.11.

Install SampleExplorer via PyPI with the following command:

pip install sample-explorer

Additional files

These additional files will need to be downloaded. Below is a table containing a description of these files, and the links.

File Name Description Size checksum
semantic_db.h5ad semantic vector store 58Mb bd23f8835032dcd709f8ca05915038b3
transcriptomic_db.h5ad transcriptomic vector store 38GB d26e564653424ed6e189aea7d0be5d4d
human_gene_v2.2.h5 ARCHS4[^1] hdf5 database 37GB f546fdecc0ba20bf2a5571628f132ca5

Note: Both the semantic vector stores and the transcriptomic vector stores are AnnData[^2] objects, containing both the embedding matrices and the indices which link each embedding vector to an experiment in the ARCHS4 database. Additionally, the transcriptomic vector store also contains derived count data of "representative transcriptomes" in the ARCHS4 database.

A download script is provided for convenience.

Advanced Usage

To load SampleExplorer, follow these steps:

  1. Download the additional files described above.

  2. Instantiate a query_db object to perform SampleExplorer search in following way:

    from sample_explorer.sample_explorer import Query_DB
    
    new_query_db = Query_DB(SEMANTIC_VECTOR_STORE_PATH, TRANSCRIPTOME_VECTOR_STORE_PATH, ARCHS4_HDF5_DATABASE_PATH)
    
  3. Run SampleExplorer with a gene set query and/or textual query. The gene set is a python list, and the textual query is a python string.

    text_query = "This text query usually describes the experiment"
    
    geneset_query = ["IFNG", "IRF1", "IFR2"]
    
    result = new_query_db.search(geneset = geneset_query, text_query = text_query)
    
  4. The output is a Results object, which is composed of three pandas dataframes, which can be accessed via dot notation.

  • The "seed_studies" variable holds a dataframe containing study metadata from the search step.
  • The "expansion_studies" holds a dataframe of studies from the expansion step,
  • The "samples" variable holds a dataframe of the relevant samples, with metadata derived from the ARCHS4 database.
    result = new_query_db.search(geneset = geneset_query, text_query = text_query)

    # save the results as csv files
    result.seed_studies.to_csv("relevant_seed_studies.csv")
    result.expansion_studies.to_csv("relevant_expansion_studies.csv")
    result.samples.to_csv("relevant_samples.csv")

Modifying searches using different inputs

All SampleExplorer searches should contain at least a text query or a gene set query. Users can choose one of several search strategies, with examples illustrated below:

  • Strategy 1: Text only as input

    result = new_query_db.search(text_query = text_query, geneset = None)
    
  • Strategy 2: Text and gene set as input

    result = new_query_db.search(text_query = text_query, geneset = geneset_query)
    
  • Strategy 3: Gene set only as input

    result = new_query_db.search(text_query = None, geneset = geneset_query)
    

Modifying searches using different expansion strategies

By default, SampleExplorer will perform semantic search, followed by transcriptomic expansion. However, this search strategy can be modified. For instance, one can use semantic search, followed by perfoming the expansion step using semantic similiarity. In this case, the transcriptome vector store is not queried.

The "seed" and "expand" parameters accepts either "transcriptome" or "semantic" as options.

If transcriptome search is performed as the initial step, the count data from "representative transcriptomes" in the transcriptome vector store is used to search for enriched samples, using ssGSEA (single sample gene set enrichment analysis) to rank the most relevant samples. In semantic search, SampleExplorer retrieves the most semantically similar studies to the text query, using cosine distance as the metric.

  1. Example 1 - perform semantic search, followed by transcriptome expansion.

    result = new_query_db.search(text_query = text_query, geneset = geneset_query, search = "semantic", expand = "transcriptome")
    
  2. Example 2 - perform transcriptome search, followed by semantic expansion.

    result = new_query_db.search(text_query = text_query, geneset = geneset_query, search = "transcriptome", expand = "semantic")
    
  3. Example 3 - perform semantic search, followed by semantic expansion, using only a text query.

    result = new_query_db.search(text_query = text_query, geneset = None, search = "semantic", expand = "semantic")
    

The types of searches which can be performed depend on input. For instance,if only a gene set is supplied, the search step defaults to "transcriptome", with the user able to select between "transcriptome" and "semantic" for the expansion step.

Optional single sample gene set enrichment analysis

To further refine the set of samples and studies returned by SampleExplorer search, ssGSEA can be peformed on all samples returned by the query. Use the "perform_enrichment" parameter to specifiy if ssGSEA should be performed on all samples. If so, the returned dataframe will contain enrichment scores, pvalues and FDRs. The ssGSEA results will be stored as a dataframe in the "samples" attribute in the Results object.

    result = new_query_db.search(text_query = text_query, geneset = None, search = "semantic", expand = "semantic", perform_enrichment = True)

    # save enrichment results from the sample dataframe
    results.samples.to_csv("samples_with_ssgsea_results.csv")

Running gene set enrichment using the containerized application

You can use the containerized environment to perform single-sample gene set enrichment analysis (ssGSEA). When a gene set is specified in the containerized application together with a natural language query, it generates a custom Python script tailored for running ssGSEA on retrieved samples.

To execute gene set enrichment with the prebuilt environment, use Docker with a volume mount. This will map the container's working directory to a local folder that contains both the custom script and the ARCHS4 HDF5 database.

Steps:

  1. Ensure that your custom gene_set.py script and the human_gene_v2.2.h5 database are placed in a local folder.

  2. Run the following Docker command, replacing /path/to/local/folder with the actual path to your local folder:

    docker run --rm -v /path/to/local/folder:/app wlc27/streamlit_sample_explorer:0.1.9 python /app/gene_set.py
    

This command will execute the gene_set.py script inside the Docker container, utilizing the local HDF5 database for gene set enrichment analysis. The results will be output as a CSV file in your local directory, containing the single-sample gene set enrichment results.

Downloading the transcriptome embeddings file

For users requiring only the default use case (natural language search followed by transcriptome expansion), an embeddings-only vector store can be used in the place of the transcriptomic vector store above. This embeddings-only file is smaller (1 GB) but does not have the reference transcriptomes described in the sections above. Hence, transcriptome search (using a gene set) as the initial step will not be possible.

License

SampleExplorer is published under the MIT License.

Database creation, benchmarking workflows, tests, and and continuous integration

The scripts for generating the embedding databases and performing benchmarking are located in the workflow folder. The repository includes a set of test data and testing scripts. The testing framework utilizes pytest.

References

[^1]: Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma'ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun. 2018 Apr 10;9(1):1366. doi: 10.1038/s41467-018-03751-6. PMID: 29636450; PMCID: PMC5893633. [^2]: AnnData

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sample_explorer-0.1.20.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

sample_explorer-0.1.20-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file sample_explorer-0.1.20.tar.gz.

File metadata

  • Download URL: sample_explorer-0.1.20.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.18 Linux/5.4.0-193-generic

File hashes

Hashes for sample_explorer-0.1.20.tar.gz
Algorithm Hash digest
SHA256 929548ff4ff989ae619e3c3a1023aba2935cf1f56c4c1da89d21b8da6302676f
MD5 83888d46096ff936336bfaf092ad6a9a
BLAKE2b-256 8a20e844f32095fd392ad60f3a06d123a7703d8aa84ca10871e9ff9feb19aaef

See more details on using hashes here.

File details

Details for the file sample_explorer-0.1.20-py3-none-any.whl.

File metadata

  • Download URL: sample_explorer-0.1.20-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.18 Linux/5.4.0-193-generic

File hashes

Hashes for sample_explorer-0.1.20-py3-none-any.whl
Algorithm Hash digest
SHA256 8b25590ccde57f08511396ea3ba29cacdd42ff0b6c888bfbcd0a93eb7a74c488
MD5 ef10a23da3242d56ec03293b67a513a3
BLAKE2b-256 3e9db1c5a224716f713e51a918cb6c04860dcb6122880804e7b2dd663fafe65e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page