Suite of tools for analysing the independence between training and evaluation biosequence datasets and to generate new generalisation-evaluating hold-out partitions
Project description
- Documentation: https://ibm.github.io/Hestia-OOD
- Source Code: https://github.com/IBM/Hestia-OOD
- Webserver: http://peptide.ucd.ie/Hestia
- Paper Pre-print: https://www.biorxiv.org/content/10.1101/2024.03.14.584508v1
Contents
Table of Contents
Installation
Installing in a conda environment is recommended. For creating the environment, please run:
conda create -n autopeptideml python
conda activate autopeptideml
1. Python Package
1.1.From PyPI
pip install hestia-ood
1.2. Directly from source
pip install git+https://github.com/IBM/Hestia-OOD
3. Third-party dependencies
For using MMSeqs as alignment algorithm is necessary install it in the environment:
conda install -c bioconda mmseqs2
For using Needleman-Wunch:
conda install -c bioconda emboss
If installation not in conda environment, please check installation instructions for your particular device:
-
Linux:
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz tar xvfz mmseqs-linux-avx2.tar.gz export PATH=$(pwd)/mmseqs/bin/:$PATH
sudo apt install emboss
sudo apt install emboss
-
Windows: Download binaries from EMBOSS and MMSeqs2-latest
-
Mac:
sudo port install emboss brew install mmseqs2
Documentation
1. DatasetGenerator
The HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).
from hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments
# Initialise the generator for a DataFrame
generator = HestiaDatasetGenerator(df)
# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)
args = SimilarityArguments(
data_type='protein', field_name='sequence',
similarity_metric='mmseqs2+prefilter', verbose=3,
save_alignment=True
)
# Calculate the similarity
generator.calculate_similarity(args)
# Load pre-calculated similarities
generator.load_similarity(args.filename + '.csv.gz')
# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
threshold_step=0.05,
test_size=0.2, valid_size=0.1)
# Save partitions
generator.save_precalculated('precalculated_partitions.gz')
# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')
# Training code
# ...
# Calculate ABOID
generator.calculate_aboid(results, 'test_mcc')
# Plot ABOID
generator.plot_aboid(results, 'test_mcc')
2. Similarity calculation
Calculating pairwise similarity between the entities within a DataFrame df_query
or between two DataFrames df_query
and df_target
can be achieved through the calculate_similarity
function:
from hestia.similarity import calculate_similarity
import pandas as pd
df_query = pd.read_csv('example.csv')
# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.
sim_df = calculate_similarity(df_query, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence')
More details about similarity calculation can be found in the Similarity calculation documentation.
3. Clustering
Clustering the entities within a DataFrame df
can be achieved through the generate_clusters
function:
from hestia.similarity import calculate_similarity
from hestia.clustering import generate_clusters
import pandas as pd
df = pd.read_csv('example.csv')
sim_df = calculate_similarity(df, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
cluster_algorithms='CDHIT')
There are three clustering algorithms currently supported: CDHIT
, greedy_cover_set
, or connected_components
. More details about clustering can be found in the Clustering documentation.
4. Partitioning
Partitioning the entities within a DataFrame df
into a training and an evaluation subsets can be achieved through 4 different functions: ccpart
, graph_part
, reduction_partition
, and random_partition
. An example of how cc_part
would be used is:
from hestia.partition import ccpart
import pandas as pd
df = pd.read_csv('example.csv')
train, test = cc_part(df, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence', threshold=0.3, test_size=0.2)
train_df = df.iloc[train, :]
test_df = df.iloc[test, :]
License
Hestia is an open-source software licensed under the MIT Clause License. Check the details in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hestia_ood-0.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 155f6d4dd4a1cbf7339de0d2d6adfd6853fb6e9c64f16d2d58702c9bfab59940 |
|
MD5 | a2b00893de85827ae4afd882a1ea2610 |
|
BLAKE2b-256 | 254ba7ae3a5cc9195bc34296ad412ef5dc00c955beb0f586192c361d455c9ca2 |