Skip to main content

Bioinformatics datasets and tools

Project description

$${\Huge{\textbf{\textsf{\color{#2E8B57}Bio\color{#4682B4}sets}}}}$$

Build GitHub Documentation GitHub release Contributor Covenant DOI

Biosets is a specialized library that extends 🤗 Datasets for bioinformatics data, providing the following main features:

  • Bioinformatics Specialization: Streamlines data management specific to bioinformatics, such as handling samples, features, batches, and associated metadata.
  • Automatic Column Detection: Infers sample, batch, input features, and target columns, simplifying downstream preprocessing.
  • Custom Data Classes: Leverages specialized data classes (ValueWithMetadata, Sample, Batch, RegressionTarget, etc.) to manage metadata-rich bioinformatics data.
  • Polars Integration: Optional Polars integration enables high-performance data manipulation, ideal for large datasets.
  • Flexible Task Support: Native support for binary classification, multiclass classification, multiclass-to-binary classification, and regression, adapting to diverse bioinformatics tasks.
  • Integration with 🤗 Datasets: load_dataset function supports loading various bioinformatics formats like CSV, JSON, NPZ, and more, including metadata integration.
  • Arrow File Caching: Uses Apache Arrow for efficient on-disk caching, enabling fast access to large datasets without memory limitations.

Biosets helps bioinformatics researchers focus on analysis rather than data handling, with seamless compatibility with 🤗 Datasets.

Installation

With pip

You can install Biosets from PyPI:

pip install biosets

With conda

Install Biosets via conda:

conda install -c patrico49 biosets

Usage

Biosets provides a straightforward API for handling bioinformatics datasets with integrated metadata management. Here's a quick example:

from biosets import load_biodata

bio_data = load_dataset(
    data_files="data_with_samples.csv",
    sample_metadata_files="sample_metadata.csv",
    feature_metadata_files="feature_metadata.csv",
    target_column="metadata1",
    experiment_type="metagenomics",
    batch_column="batch",
    sample_column="sample",
    metadata_columns=["metadata1", "metadata2"],
    drop_samples=False
)["train"]

For further details, check the advance usage documentation.

Main Differences Between Biosets and 🤗 Datasets

  • Bioinformatics Focus: While 🤗 Datasets is a general-purpose library, Biosets is tailored for the bioinformatics domain.
  • Seamless Metadata Integration: Biosets is built for datasets with metadata dependencies, like sample and feature metadata.
  • Automatic Column Detection: Reduces preprocessing time with automatic inference of sample, batch, feature, and label columns.
  • Specialized Data Classes: Biosets introduces custom classes (e.g., Sample, Batch, ValueWithMetadata) to enable richer data representation.

Disclaimers

Biosets may run Python code from custom datasets scripts to handle specific data formats. For security, users should:

  • Inspect dataset scripts prior to execution.
  • Use pinned versions for any repository dependencies.

If you manage a dataset and wish to update or remove it, please open a discussion or pull request on the Community tab of 🤗's datasets page.

BibTeX

If you'd like to cite Biosets, please use the following:

@misc{smyth2024biosets,
    title = {psmyth94/biosets: 1.1.0},
    author = {Patrick Smyth},
    year = {2024},
    url = {https://github.com/psmyth94/biosets},
    note = {A library designed to support bioinformatics data with custom features, metadata integration, and compatibility with 🤗 Datasets.}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biosets-1.2.1.tar.gz (83.7 kB view details)

Uploaded Source

Built Distribution

biosets-1.2.1-py3-none-any.whl (93.2 kB view details)

Uploaded Python 3

File details

Details for the file biosets-1.2.1.tar.gz.

File metadata

  • Download URL: biosets-1.2.1.tar.gz
  • Upload date:
  • Size: 83.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for biosets-1.2.1.tar.gz
Algorithm Hash digest
SHA256 7adf64679a07aa52c96dd1bcdaea2f222e16db04de4907c90b557059a0bf865b
MD5 f53a865c784e06d1515b4f38ae03be19
BLAKE2b-256 3e484af7a13ae318f51c9c0dc8644b9b08217b5d8dac5c275ef331d52342941c

See more details on using hashes here.

File details

Details for the file biosets-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: biosets-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 93.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for biosets-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3dde3ac1e7e7ae754e725084b6d7a8cab3f420706e182a3fa537a585f0060180
MD5 e82587f0a14b73da24527371fd0bec92
BLAKE2b-256 8d8d1ec0df3d5de9423f2c76e2246de55fa68e58ddf0ce1fbabae090ddfc4030

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page