Diverse Genomic Embedding Benchmark

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jlkravitz

These details have not been verified by PyPI

Project links

Huggingface Organization

Project description

title: DGEB app_file : leaderboard/app.py sdk: docker sdk_version: 4.36.1

Diverse Genomic Embedding Benchmark

Installation | Usage | Leaderboard | Citing

DGEB is a benchmark for evaluating biological sequence models on functional and evolutionary information.

DGEB is designed to evaluate model embeddings using:

Diverse sequences accross the tree of life.
Diverse tasks that capture different aspects of biological function.
Both amino acid and nucleotide sequences.

The current version of DGEB consists of 18 datasets covering all three domains of life (Bacteria, Archaea and Eukarya). DGEB evaluates embeddings using six different embedding tasks: Classification, BiGene mining, Evolutionary Distance Similarity (EDS), Pair Classification, Clustering, and Retrieval.

We welcome contributions of new tasks and datasets.

Installation

Install DGEB using pip.

pip install dgeb

Usage

Launch evaluation using the python script (see cli.py):

dgeb --model facebook/esm2_t6_8M_UR50D

To see all supported models and tasks:

dgeb --help

Using the python API:

import dgeb

model = dgeb.get_model("facebook/esm2_t6_8M_UR50D")
tasks = dgeb.get_tasks_by_modality(dgeb.Modality.PROTEIN)
evaluation = dgeb.DGEB(tasks=tasks)
# Writes results to `output_folder`, and returns a list of TaskResult.
# You can disable writing to json by setting `output_folder=None`.
results = evaluation.run(model, output_folder="results")

Using a custom model

Custom models should be wrapped with the dgeb.models.BioSeqTransformer abstract class, and specify the modality, number of layers, and embedding dimension. See models.py for additional examples on custom model loading and inference.

import dgeb
from dgeb.models import BioSeqTransformer
from dgeb.tasks.tasks import Modality

class MyModel(BioSeqTransformer):

    @property
    def modality(self) -> Modality:
        return Modality.PROTEIN

    @property
    def num_layers(self) -> int:
        return self.config.num_hidden_layers

    @property
    def embed_dim(self) -> int:
        return self.config.hidden_size


model = MyModel(model_name='path_to/huggingface_model')
tasks = dgeb.get_tasks_by_modality(model.modality)
evaluation = dgeb.DGEB(tasks=tasks)
evaluation.run(model)

Evaluating on a custom dataset

We strongly encourage users to contribute their custom datasets to DGEB. Please open a PR adding your dataset so that the community can benefit!

To evaluate on a custom dataset, first upload your dataset to the Huggingface Hub. Then define a Task subclass with TaskMetadata that points to your huggingface dataset. For example, a classification task on a custom dataset can be defined as follows:

import dgeb
from dgeb.models import BioSeqTransformer
from dgeb.tasks import Dataset, Task, TaskMetadata, TaskResult
from dgeb.tasks.classification_tasks import run_classification_task

class MyCustomTask(Task):
    metadata = TaskMetadata(
        id="my_custom_classification",
        display_name="...",
        description="...",
        type="classification",
        modality=Modality.PROTEIN,
        datasets=[
            Dataset(
                path="path_to/huggingface_dataset",
                revision="...",
            )
        ],
        primary_metric_id="f1",
    )

    def run(self, model: BioSeqTransformer) -> TaskResult:
        return run_classification_task(model, self.metadata)

model = dgeb.get_model("facebook/esm2_t6_8M_UR50D")
evaluation = dgeb.DGEB(tasks=[MyCustomTask])
evaluation.run(model)

Leaderboard

To add your submission to the DGEB leaderboard, proceed through the following instructions.

Fork the DGEB repository by following GitHub's instruction Forking Workflow.
Add your submission .json file to the leaderboard/submissions/<HF_MODEL_NAME>/ directory.

mv /path/to/<SUBMISSION_FILE>.json /path/to/DGEB/leaderboard/submissions/<HF_MODEL_NAME>/

Update your fork with the new submission:

git add leaderboard/submissions/<HF_MODEL_NAME>/<SUBMISSION_FILE>.json
git commit -m "Add submission for <HF_MODEL_NAME>"
git push

Open a pull request to the main branch of the repository via the Github interface.
Once the PR is review and merged, your submission will be added to the leaderboard!

Acknowledgements

DGEB follows the design of text embedding bechmark MTEB developed by Huggingface 🤗. The evaluation code is adapted from the MTEB codebase.

Citing

DGEB was introduced in "Diverse Genomic Embedding Benchmark for Functional Evaluation Across the Tree of Life", feel free to cite:

@article{WestRoberts2024,
  title = {Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life},
  url = {http://dx.doi.org/10.1101/2024.07.10.602933},
  DOI = {10.1101/2024.07.10.602933},
  publisher = {Cold Spring Harbor Laboratory},
  author = {West-Roberts,  Jacob and Kravitz,  Joshua and Jha,  Nishant and Cornman,  Andre and Hwang,  Yunha},
  year = {2024},
  month = jul 
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jlkravitz

These details have not been verified by PyPI

Project links

Huggingface Organization

Release history Release notifications | RSS feed

This version

0.2.0

Sep 3, 2024

0.1.1

Sep 3, 2024

0.1.0

Jul 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dgeb-0.2.0.tar.gz (162.5 kB view details)

Uploaded Sep 3, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dgeb-0.2.0-py3-none-any.whl (296.4 kB view details)

Uploaded Sep 3, 2024 Python 3

File details

Details for the file dgeb-0.2.0.tar.gz.

File metadata

Download URL: dgeb-0.2.0.tar.gz
Upload date: Sep 3, 2024
Size: 162.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for dgeb-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b17ef4de7fada6051eeeca2f9e4b030fe2f4efd533f68f2ace6fd7c5f2d399a7`
MD5	`b895e1d2420c8def419ec4e0e865873c`
BLAKE2b-256	`d6df684e6c09b30131c792d03d3c14453b99c23b0db70cb48d030f9950dde9c3`

See more details on using hashes here.

File details

Details for the file dgeb-0.2.0-py3-none-any.whl.

File metadata

Download URL: dgeb-0.2.0-py3-none-any.whl
Upload date: Sep 3, 2024
Size: 296.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for dgeb-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ffe81955787ec5ed7914ee4a546f8d2c762feee50f9dc95c36a2ff947da7de67`
MD5	`18181f1d266f72f5d7847fa2ce1369c9`
BLAKE2b-256	`5e7fc18bb59d43dfe7d37148336780e1e3f00bf9a8af3ff704392b56ffe4e709`

See more details on using hashes here.

dgeb 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

title: DGEB app_file : leaderboard/app.py sdk: docker sdk_version: 4.36.1

Diverse Genomic Embedding Benchmark

Installation | Usage | Leaderboard | Citing

Installation

Usage

Using a custom model

Evaluating on a custom dataset

Leaderboard

Acknowledgements

Citing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes