Skip to main content

Independent evaluation set construction for trustworthy ML models in biochemistry

Project description

Hestia-GOOD

Computational tool for generating generalisation-evaluating evaluation sets.

Tutorials GitHub Open In Colab

Contents

Table of Contents

Installation

Installing in a conda environment is recommended. For creating the environment, please run:

conda create -n hestia python
conda activate hestia

1. Python Package

1.1.From PyPI

pip install hestia-good

1.2. Directly from source

pip install git+https://github.com/IBM/Hestia-GOOD

2. Optional dependencies

2.1. Molecular similarity

RDKit is a dependency necessary for calculating molecular similarities:

pip install rdkit

2.2. Sequence alignment

# static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

# static build with SSE4.1  (check using: cat /proc/cpuinfo | grep sse4)
wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

# static build with SSE2 (slowest, for very old systems)  (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

# MacOS
brew install mmseqs2  

To use Needleman-Wunch, either:

conda install -c bioconda emboss

or

sudo apt install emboss

2.3. Structure alignment

# Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

# Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

# Linux ARM64 build
wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

# MacOS
wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

Documentation

1. DatasetGenerator

The HestiaGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the AU-GOOD (Area Under the Generalization Out-Of-Distribution curve). More information in Dataset Generator docs.

from hestia.dataset_generator import HestiaGenerator, SimArguments

# Initialise the generator for a DataFrame
generator = HestiaGenerator(df)

# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)

# Similarity arguments for protein similarity
prot_args = SimArguments(
    data_type='sequence', field_name='sequence',
    alignment_algorithm='mmseqs2+prefilter', verbose=3
)

# Similarity arguments for molecular similarity
mol_args = SimArguments(
    data_type='small molecule', field_name='SMILES',
    fingeprint='mapc', radius=2, bits=2048
)

# Calculate the similarity
generator.calculate_similarity(prot_args)

# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
                               threshold_step=0.05,
                               test_size=0.2, valid_size=0.1)

# Save partitions
generator.save_precalculated('precalculated_partitions.gz')

# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')

# Training code (filter partitions with test sets less than 18.5% of total data)

for threshold, partition in generator.get_partitions(filter=0.185):
    train = df.iloc[partition['train']]
    valid = df.iloc[partition['valid']]
    test = df.iloc[partition['test']]

# ...

# Calculate AU-GOOD
generator.calculate_augood(results, 'test_mcc')

# Plot GOOD
generator.plot_good(results, 'test_mcc')

# Compare two models
results = {'model A': [values_A], 'model B': [values_B]}
generator.compare_models(results, statistical_test='wilcoxon')

2. Similarity calculation

Calculating pairwise similarity between the entities within a DataFrame df_query or between two DataFrames df_query and df_target can be achieved through the calculate_similarity function. More details about similarity calculation can be found in the Similarity calculation documentation.

from hestia.similarity import sequence_similarity_mmseqs
import pandas as pd

df_query = pd.read_csv('example.csv')

# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.

sim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)

3. Clustering

Clustering the entities within a DataFrame df can be achieved through the generate_clusters function. There are three clustering algorithms currently supported: CDHIT, greedy_cover_set, or connected_components. More details about clustering can be found in the Clustering documentation.

from hestia.similarity import sequence_similarity_mmseqs
from hestia.clustering import generate_clusters
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
                                cluster_algorithm='CDHIT')

4. Partitioning

Partitioning the entities within a DataFrame df into a training and an evaluation subsets can be achieved through 4 different functions: ccpart, graph_part, reduction_partition, and random_partition. More details about partitioing algorithms can be found in Partitionind documentation. An example of how cc_part would be used is:

from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
train, test, partition_labs = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)

train_df = df.iloc[train, :]
test_df = df.iloc[test, :]

License

Hestia is an open-source software licensed under the MIT Clause License. Check the details in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hestia_good-1.0.4.tar.gz (41.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hestia_good-1.0.4-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file hestia_good-1.0.4.tar.gz.

File metadata

  • Download URL: hestia_good-1.0.4.tar.gz
  • Upload date:
  • Size: 41.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.9

File hashes

Hashes for hestia_good-1.0.4.tar.gz
Algorithm Hash digest
SHA256 fc2faa4b9bcd7b02013b5c1a8fa81acebeb4d6415bf6fbbe9ea1acbcea6f1977
MD5 b6486af07da3fa4073376165ddfc6b43
BLAKE2b-256 70106e2c78ea6d179d0255cc9fa8a56966b713869da8def09d4dc1ce794be195

See more details on using hashes here.

File details

Details for the file hestia_good-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: hestia_good-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.9

File hashes

Hashes for hestia_good-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 fcb7b9af722e30a5b9f4d00a1968c8b13f9ab0a832491bf720109278ccb4acd0
MD5 8b199338fad2d7a43e25087eae9d3eb9
BLAKE2b-256 4f3cf5c00bfcfdec3cc7c2c9d0830a7db05a92edde678d8b7217ac1924111f62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page