Suite of tools for analysing the independence between training and evaluation biosequence datasets and to generate new generalisation-evaluating hold-out partitions
Project description
- Documentation: https://ibm.github.io/Hestia-OOD
- Source Code: https://github.com/IBM/Hestia-OOD
- Webserver: http://peptide.ucd.ie/Hestia
Contents
Table of Contents
Installation
Installing in a conda environment is recommended. For creating the environment, please run:
conda create -n autopeptideml python
conda activate autopeptideml
1. Python Package
1.1.From PyPI
pip install hestia-ood
1.2. Directly from source
pip install git+https://github.com/IBM/Hestia-OOD
3. Optional dependencies
3.1. Molecular similarity
RDKit is a dependency necessary for calculating molecular similarities:
pip install rdkit
3.2. Sequence alignment
For using MMSeqs as alignment algorithm is necessary install it in the environment:
conda install -c bioconda mmseqs2
For using Needleman-Wunch:
conda install -c bioconda emboss
If installation not in conda environment, please check installation instructions for your particular device:
-
Linux:
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz tar xvfz mmseqs-linux-avx2.tar.gz export PATH=$(pwd)/mmseqs/bin/:$PATH
sudo apt install emboss
sudo apt install emboss
-
Windows: Download binaries from EMBOSS and MMSeqs2-latest
-
Mac:
sudo port install emboss brew install mmseqs2
Documentation
1. DatasetGenerator
The HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).
from hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments
# Initialise the generator for a DataFrame
generator = HestiaDatasetGenerator(df)
# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)
args = SimilarityArguments(
data_type='protein', field_name='sequence',
similarity_metric='mmseqs2+prefilter', verbose=3,
save_alignment=True
)
# Calculate the similarity
generator.calculate_similarity(args)
# Load pre-calculated similarities
generator.load_similarity(args.filename + '.csv.gz')
# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
threshold_step=0.05,
test_size=0.2, valid_size=0.1)
# Save partitions
generator.save_precalculated('precalculated_partitions.gz')
# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')
# Training code
# ...
# Calculate ABOID
generator.calculate_aboid(results, 'test_mcc')
# Plot ABOID
generator.plot_aboid(results, 'test_mcc')
2. Similarity calculation
Calculating pairwise similarity between the entities within a DataFrame df_query
or between two DataFrames df_query
and df_target
can be achieved through the calculate_similarity
function:
from hestia.similarity import calculate_similarity
import pandas as pd
df_query = pd.read_csv('example.csv')
# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.
sim_df = calculate_similarity(df_query, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence')
More details about similarity calculation can be found in the Similarity calculation documentation.
3. Clustering
Clustering the entities within a DataFrame df
can be achieved through the generate_clusters
function:
from hestia.similarity import calculate_similarity
from hestia.clustering import generate_clusters
import pandas as pd
df = pd.read_csv('example.csv')
sim_df = calculate_similarity(df, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
cluster_algorithm='CDHIT')
There are three clustering algorithms currently supported: CDHIT
, greedy_cover_set
, or connected_components
. More details about clustering can be found in the Clustering documentation.
4. Partitioning
Partitioning the entities within a DataFrame df
into a training and an evaluation subsets can be achieved through 4 different functions: ccpart
, graph_part
, reduction_partition
, and random_partition
. An example of how cc_part
would be used is:
from hestia.partition import ccpart
import pandas as pd
df = pd.read_csv('example.csv')
train, test = cc_part(df, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence', threshold=0.3, test_size=0.2)
train_df = df.iloc[train, :]
test_df = df.iloc[test, :]
License
Hestia is an open-source software licensed under the MIT Clause License. Check the details in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hestia_ood-0.0.19.tar.gz
.
File metadata
- Download URL: hestia_ood-0.0.19.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1bed3bdbe2f91263b71ef3966b80fcbb10337822fd8ca59c64d2565b20a14ee |
|
MD5 | 0019ca987609e1ac91f6e689dcd98179 |
|
BLAKE2b-256 | 3869f7af87452ab79f93051d879bbabdeb44ba4c2eadbd8364405c27b8174f16 |
File details
Details for the file hestia_ood-0.0.19-py3-none-any.whl
.
File metadata
- Download URL: hestia_ood-0.0.19-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7cb74d5a80291c7941219932ebc50becfeceb99e2c8042d377803472df2c2261 |
|
MD5 | 48982f0634b825277e320a66af1cc77d |
|
BLAKE2b-256 | 4c91da4f7a9bcdca6753f2aa9a1ddc0a74ab28fe34ac6a14a4fade3f4fc4e379 |