Open-source collection of biology datasets and pre-trained embeddings.
Project description
bio-datasets
Open-source collection of biology datasets and pre-trained embeddings.
Description
bio-datasets is a collaborative framework that allows the user to fetch publicly available sequence-based protein datasets. For these datasets, pre-trained contextual embeddings are also available.
Installation
Install the required dependencies with pip install -r requirements.txt
.
How it works
from biodatasets import list_datasets, load_dataset
print(list_datasets())
my_dataset = load_dataset('test')
X, y = my_dataset.to_npy_arrays(input_names=['peptide'], target_names=['target'])
embeddings = my_dataset.get_embeddings(variable_name="peptide", model_name="protbert", embeddings_type="cls")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bio-datasets-0.0.1.tar.gz
(4.8 kB
view hashes)
Built Distribution
Close
Hashes for bio_datasets-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4af62561fd543d92b11569a7a6ecbbd1cec8b64622b3531d61c1d255914df1f5 |
|
MD5 | 46665517e22ee26db2102b23134ed0ca |
|
BLAKE2b-256 | 56c8eb5415c0dd73903e09ee569fb866faf27070a82dba119884868b532acee5 |