Suite of tools for analysing the independence between training and evaluation biosequence datasets and to generate new generalisation-evaluating hold-out partitions

These details have not been verified by PyPI

Project links

Homepage

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

Hestia-GOOD

Computational tool for generating generalisation-evaluating evaluation sets.

Documentation: https://ibm.github.io/Hestia-OOD
Source Code: https://github.com/IBM/Hestia-OOD
Paper pre-print: https://www.biorxiv.org/content/10.1101/2024.03.14.584508

Table of Contents

Intallation Guide
Documentation
Examples
License

Installation

Installing in a conda environment is recommended. For creating the environment, please run:

conda create -n hestia python
conda activate hestia

1. Python Package

1.1.From PyPI

pip install hestia-ood

1.2. Directly from source

pip install git+https://github.com/IBM/Hestia-OOD

3. Optional dependencies

3.1. Molecular similarity

RDKit is a dependency necessary for calculating molecular similarities:

pip install rdkit

3.2. Sequence alignment

MMSeqs2 https://github.com/steineggerlab/mmseqs2

# static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

# static build with SSE4.1  (check using: cat /proc/cpuinfo | grep sse4)
wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

# static build with SSE2 (slowest, for very old systems)  (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

# MacOS
brew install mmseqs2

To use Needleman-Wunch, either:

conda install -c bioconda emboss

sudo apt install emboss

Windows: Download binaries from EMBOSS and MMSeqs2-latest

3.3. Structure alignment

To use Foldseek https://github.com/steineggerlab/foldseek:

# Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

# Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

# Linux ARM64 build
wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

# MacOS
wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

Documentation

1. DatasetGenerator

The HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).

from hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments

# Initialise the generator for a DataFrame
generator = HestiaDatasetGenerator(df)

# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)

# Similarity arguments for protein similarity
prot_args = SimilarityArguments(
    data_type='sequence', field_name='sequence',
    alignment_algorithm='mmseqs2+prefilter', verbose=3
)

# Similarity arguments for molecular similarity
mol_args = SimilarityArguments(
    data_type='small molecule', field_name='SMILES',
    fingeprint='mapc', radius=2, bits=2048
)

# Calculate the similarity
generator.calculate_similarity(prot_args)

# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
                               threshold_step=0.05,
                               test_size=0.2, valid_size=0.1)

# Save partitions
generator.save_precalculated('precalculated_partitions.gz')

# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')

# Training code

for threshold, partition in generator.get_partitions():
    train = df.iloc[partition['train']]
    valid = df.iloc[partition['valid']]
    test = df.iloc[partition['test']]

# ...

# Calculate AU-GOOD
generator.calculate_augood(results, 'test_mcc')

# Plot GOOD
generator.plot_good(results, 'test_mcc')

# Compare two models
results = {'model A': [values_A], 'model B': [values_B]}
generator.compare_models(results, statistical_test='wilcoxon')

2. Similarity calculation

Calculating pairwise similarity between the entities within a DataFrame df_query or between two DataFrames df_query and df_target can be achieved through the calculate_similarity function:

from hestia.similarity import sequence_similarity_mmseqs
import pandas as pd

df_query = pd.read_csv('example.csv')

# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.

sim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)

More details about similarity calculation can be found in the Similarity calculation documentation.

3. Clustering

Clustering the entities within a DataFrame df can be achieved through the generate_clusters function:

from hestia.similarity import sequence_similarity_mmseqs
from hestia.clustering import generate_clusters
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
                                cluster_algorithm='CDHIT')

There are three clustering algorithms currently supported: CDHIT, greedy_cover_set, or connected_components. More details about clustering can be found in the Clustering documentation.

4. Partitioning

Partitioning the entities within a DataFrame df into a training and an evaluation subsets can be achieved through 4 different functions: ccpart, graph_part, reduction_partition, and random_partition. An example of how cc_part would be used is:

from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
train, test, partition_labs = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)

train_df = df.iloc[train, :]
test_df = df.iloc[test, :]

License

Hestia is an open-source software licensed under the MIT Clause License. Check the details in the LICENSE file.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.36

Jan 13, 2025

0.0.35

Dec 5, 2024

0.0.34

Nov 11, 2024

0.0.33

Nov 11, 2024

0.0.32

Nov 11, 2024

0.0.31

Oct 28, 2024

0.0.30

Oct 7, 2024

0.0.29

Oct 2, 2024

0.0.28

Oct 2, 2024

0.0.27

Sep 24, 2024

0.0.26

Sep 19, 2024

0.0.25

Sep 19, 2024

0.0.24

Sep 19, 2024

0.0.23

Sep 18, 2024

0.0.22

Sep 17, 2024

0.0.21

Sep 13, 2024

0.0.19

Aug 20, 2024

0.0.18

Aug 14, 2024

0.0.17

Aug 2, 2024

0.0.16

Jul 25, 2024

0.0.15

Jul 17, 2024

0.0.14

Jul 17, 2024

0.0.12

Jul 17, 2024

0.0.11

May 24, 2024

0.0.9

May 17, 2024

0.0.8

May 3, 2024

0.0.6

Mar 13, 2024

0.0.5

Mar 12, 2024

0.0.4

Mar 12, 2024

0.0.3

Mar 6, 2024

0.0.2

Mar 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hestia_ood-0.0.36.tar.gz (33.2 kB view details)

Uploaded Jan 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hestia_ood-0.0.36-py3-none-any.whl (32.3 kB view details)

Uploaded Jan 13, 2025 Python 3

File details

Details for the file hestia_ood-0.0.36.tar.gz.

File metadata

Download URL: hestia_ood-0.0.36.tar.gz
Upload date: Jan 13, 2025
Size: 33.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.9.21

File hashes

Hashes for hestia_ood-0.0.36.tar.gz
Algorithm	Hash digest
SHA256	`a4dc021f6f696095ca149c802afbb12dba85981b92c4ebbdb4c72092a59efc70`
MD5	`8080407092c07a2fa6e89c94bd0811ac`
BLAKE2b-256	`8f9abcad8d715a2aaf04110973e140b7b8a9d3241bcf0e21682deb166948f0ce`

See more details on using hashes here.

File details

Details for the file hestia_ood-0.0.36-py3-none-any.whl.

File metadata

Download URL: hestia_ood-0.0.36-py3-none-any.whl
Upload date: Jan 13, 2025
Size: 32.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.9.21

File hashes

Hashes for hestia_ood-0.0.36-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7675162937e08c931fd61e1608b1e74ca6936705f1cf462c5c81d2df09552371`
MD5	`2f6e08d0889f5bb5fd1c66c77ff879ad`
BLAKE2b-256	`cabdf3800b2ee1ecfe813c4fa457ee843eede55b45b8632f1fd07abc2288c8cf`

See more details on using hashes here.

hestia-ood 0.0.36

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Hestia-GOOD

Contents

Installation

1. Python Package

1.1.From PyPI

1.2. Directly from source

3. Optional dependencies

3.1. Molecular similarity

3.2. Sequence alignment

3.3. Structure alignment

Documentation

1. DatasetGenerator

2. Similarity calculation

3. Clustering

4. Partitioning

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes