Open-source collection of biology datasets and pre-trained embeddings.
Project description
bio-datasets
Open-source collection of biology datasets and pre-trained embeddings.
Description
bio-datasets is a collaborative framework that allows the user to fetch publicly available sequence-based protein datasets. For these datasets, pre-trained contextual embeddings are also available.
Installation
Install the required dependencies with pip install biodatasets
.
How it works
from biodatasets import list_datasets, load_dataset
print(list_datasets())
pathogen = load_dataset("pathogen")
X, y = pathogen.to_npy_arrays(input_names=["sequence"], target_names=["class"])
embeddings = pathogen.get_embeddings("sequence", "protbert", "cls")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bio-datasets-0.0.3.tar.gz
(5.4 kB
view hashes)
Built Distribution
Close
Hashes for bio_datasets-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b325975c1c6dc0b45c6d9e2a433fc9192c4c2442bf9e2c1453d21c4fd8a97c8 |
|
MD5 | c90dde9380e9644ecd89bc779ae3ede6 |
|
BLAKE2b-256 | 29511caf7437416603cefd9cc63e7cce1fe4a972c74d69e8b748ddab9ee30e6c |