Skip to main content

Suite of tools for analysing the independence between training and evaluation biosequence datasets and to generate new generalisation-evaluating hold-out partitions

Project description

Hestia

Computational tool for generating generalisation-evaluating evaluation sets.

Tutorials GitHub

Contents

Table of Contents

Installation

Installing in a conda environment is recommended. For creating the environment, please run:

conda create -n autopeptideml python
conda activate autopeptideml

1. Python Package

1.1.From PyPI

pip install hestia-ood

1.2. Directly from source

pip install git+https://github.com/IBM/Hestia-OOD

3. Optional dependencies

3.1. Molecular similarity

RDKit is a dependency necessary for calculating molecular similarities:

pip install rdkit

3.2. Sequence alignment

For using MMSeqs as alignment algorithm is necessary install it in the environment:

conda install -c bioconda mmseqs2

For using Needleman-Wunch:

conda install -c bioconda emboss

If installation not in conda environment, please check installation instructions for your particular device:

  • Linux:

    wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
    tar xvfz mmseqs-linux-avx2.tar.gz
    export PATH=$(pwd)/mmseqs/bin/:$PATH
    
    sudo apt install emboss
    
    sudo apt install emboss
    
  • Windows: Download binaries from EMBOSS and MMSeqs2-latest

  • Mac:

    sudo port install emboss
    brew install mmseqs2
    

Documentation

1. DatasetGenerator

The HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).

from hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments

# Initialise the generator for a DataFrame
generator = HestiaDatasetGenerator(df)

# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)
args = SimilarityArguments(
    data_type='protein', field_name='sequence',
    similarity_metric='mmseqs2+prefilter', verbose=3
)

# Calculate the similarity
generator.calculate_similarity(args)

# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
                               threshold_step=0.05,
                               test_size=0.2, valid_size=0.1)

# Save partitions
generator.save_precalculated('precalculated_partitions.gz')

# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')

# Training code

for threshold, partition in generator.get_partitions():
    train = df.iloc[partition['train']]
    valid = df.iloc[partition['valid']]
    test = df.iloc[partition['test']]

# ...

# Calculate AU-GOOD
generator.calculate_augood(results, 'test_mcc')

# Plot GOOD
generator.plot_good(results, 'test_mcc')

# Compare two models
results = {'model A': [values_A], 'model B': [values_B]}
generator.compare_models(results, statistical_test='wilcoxon')

2. Similarity calculation

Calculating pairwise similarity between the entities within a DataFrame df_query or between two DataFrames df_query and df_target can be achieved through the calculate_similarity function:

from hestia.similarity import calculate_similarity
import pandas as pd

df_query = pd.read_csv('example.csv')

# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.

sim_df = calculate_similarity(df_query, species='protein', similarity_metric='mmseqs+prefilter',
                              field_name='sequence')

More details about similarity calculation can be found in the Similarity calculation documentation.

3. Clustering

Clustering the entities within a DataFrame df can be achieved through the generate_clusters function:

from hestia.similarity import calculate_similarity
from hestia.clustering import generate_clusters
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = calculate_similarity(df, species='protein', similarity_metric='mmseqs+prefilter',
                              field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
                                cluster_algorithm='CDHIT')

There are three clustering algorithms currently supported: CDHIT, greedy_cover_set, or connected_components. More details about clustering can be found in the Clustering documentation.

4. Partitioning

Partitioning the entities within a DataFrame df into a training and an evaluation subsets can be achieved through 4 different functions: ccpart, graph_part, reduction_partition, and random_partition. An example of how cc_part would be used is:

from hestia.partition import ccpart
import pandas as pd

df = pd.read_csv('example.csv')
train, test = cc_part(df, species='protein', similarity_metric='mmseqs+prefilter',
                      field_name='sequence', threshold=0.3, test_size=0.2)

train_df = df.iloc[train, :]
test_df = df.iloc[test, :]

License

Hestia is an open-source software licensed under the MIT Clause License. Check the details in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hestia_ood-0.0.21.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

hestia_ood-0.0.21-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file hestia_ood-0.0.21.tar.gz.

File metadata

  • Download URL: hestia_ood-0.0.21.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for hestia_ood-0.0.21.tar.gz
Algorithm Hash digest
SHA256 a2156d5aa441496dba5705145b8ede40bc4b487e030f2098882c3cfc56e6296f
MD5 d1fe896249cd37f54cf5ea0b8c4e7f0c
BLAKE2b-256 e69487528bfd62c13c35682c28fa5466fb4d402795a8f3e999103caa3057e5fc

See more details on using hashes here.

File details

Details for the file hestia_ood-0.0.21-py3-none-any.whl.

File metadata

  • Download URL: hestia_ood-0.0.21-py3-none-any.whl
  • Upload date:
  • Size: 30.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for hestia_ood-0.0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 fa4e538b5773f5a8fb15e0561b275b03852bd4a8b94bfd6ee7b383d907f6b6d7
MD5 aa75d35c2c30c0247f1bd557eb31681f
BLAKE2b-256 84934cb9d8d1d00da857936db0ab4b33abaac5cd2a13666d55ac95bba4279019

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page