Bioinformatics datasets and tools
Project description
BioSets: Dataset Creation for Biological Research
Please note that this project is in the early stages of development. The documentation and features are subject to change.
Overview
BioSets is a library built on top of the datasets
library for loading, manipulating,
and processing biological datasets for machine learning purposes. It supports genomics,
transcriptomics, proteomics, metabolomics, and other types of biological data.
This repository contains tools and documentation for creating biological datasets using BioSets. The library loads biological data from local files, creates custom datasets, and handles large volumes of biological information. BioSets is intended for researchers and data scientists in bioinformatics, systems biology, and biotechnology.
Features
🧬 Loading sample metadata and feature metadata: BioSets loads both sample metadata and feature metadata.
🧬 Support for various biological data types: Includes predefined classes for genomic variants, gene expression data, clinical trial data, and OTU tables.
🧬 Automatic Sample/Batch Detection: Automatically detects sample and batch information from the loaded data to handle batch effects and confounding factors.
🧬 Custom dataset creation: Create custom datasets with specific features, metadata, and labels.
🧬 Integration with datasets library: BioSets builds on the datasets
library's
functionality. Note that if path
is not a value found in
biosets.list_experiment_types()
, it acts like Huggingface's datasets
library.
Getting Started
To use the BioSets library, clone the repository and install the necessary dependencies. After setting up your environment, create your dataset by following the steps below.
Installation
Install BioSets using pip:
pip install biosets
Creating a Biological Dataset
To create a dataset for biological research using BioSets, follow these steps:
-
Organize Your Data: Prepare your biological data in a structured format that BioSets can process (e.g., directory of relevant files).
-
Load Your Data with Metadata: Use
load_dataset()
to load your data along with sample metadata and feature metadata:from biosets import load_dataset dataset = load_dataset( "snp", data_files="/path/to/snp_data.csv", sample_metadata_files="/path/to/sample_metadata.csv", feature_metadata_files="/path/to/feature_metadata.csv", )
-
Utilize Metadata for Analysis: The loaded dataset allows you to access and use metadata in downstream analyses. For example, you can handle abundance data differently based on its type:
from biosets.features import Abundance for k, v in dataset.features.items(): if isinstance(v, Abundance): print(f"Processing abundance feature: {k}")
Dataset Examples
Loading Specific Experiments
Use specific experiment types for loading data, such as otu
, maldi
, rna
, or snp
to ensure the appropriate configuration is applied:
🧬 OTU Data
dataset = load_dataset("otu", data_files="/path/to/otu_data.csv")
🧬 RNA Data
dataset = load_dataset("rna", data_files="/path/to/rna_data.csv")
🧬 SNP Data
dataset = load_dataset("snp", data_files="/path/to/snp_data.csv")
Next Steps
After creating your biological dataset, you can use BioSets for feature extraction, model training, or data visualization.
For more advanced usage, refer to the dataset loading documentation. For building custom datasets, refer to the custom dataset creation documentation.
For any additional information, refer to the datasets library documentation.
Contributing
Contributions are welcome! If you have suggestions for improvements or new features, open an issue or submit a pull request. For major changes, open an issue first to discuss it.
License
This project contains portions derived from various sources under the Apache License, Version 2.0. For full details, please refer to the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.