No project description provided

These details have not been verified by PyPI

Project links

Project description

Squidly

Squidly_mascot

Squidly, is a tool that employs a biologically informed contrastive learning approach to accurately predict catalytic residues from enzyme sequences. We offer Squidly as ensembled with Blast to achieve high accuracy at low and high sequence homology settings.

If you use squidly in your work please cite our preprint: https://www.biorxiv.org/content/10.1101/2025.06.13.659624v1

Also if you have any issues installing, please post an issue! We have tested this on ubuntu.

Requirements

Squidly is dependant on the ESM2 3B or 15B protein language model (so requires GPU). Running Suidly will automatically attempt to download each model. The Smaller 3B model is lighter, runs faster and requires less VRAM. The 3B and 15B models expect roughly 25GB and 74GB of VRAM, respectively. Our tests on 100 sequences with an average length of 400+ took about 40 seconds and 3 minutes for the 3B and 15B model.

Squidly can only predict sequences of less than 1024 residues!

Currently we expect GPU access but if you require a CPU only version please let us know and we can update this!

Note you can run the blast version without GPU, just run the --database reviewed_sprot_08042025.csv version described below. Then BLAST will run before squidly and you can use these files.

Installation

conda create --name squidly
conda activate squidly
conda install -c conda-forge huggingface_hub
conda install -c bioconda -c conda-forge diamond -y
conda update diamond
pip install squidly
squidly install
wget https://raw.githubusercontent.com/WRiegs/Squidly/main/data/reviewed_sprot_08042025.csv.zip
unzip reviewed_sprot_08042025.csv.zip
conda install sbl::clustalomega

Running squidly install should automatically download all models from huggingface. Now you can run squidly (see Usage below).

Note if you get the below error:

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject you may need to update numpy and pandas.

Usage

See the colab notebook (SquidlyExampleColab.ipynb) for an example, it also shows how the input files are formatted.

By default squidly runs the ensembled squidly model. It's optimised to run as fast as the single model.

For example to run the 3B model with a fasta file

squidly run example.fasta esm2_t36_3B_UR50D

In the folder you have run that, you can now see the generated files from squidly e.g. squidly_input_fasta_squidly_ensemble_predictions_only.csv

Squidly can also be further ensembled with BLAST (you need to pass the database as well)

squidly run example.fasta esm2_t36_3B_UR50D output_folder/ --database reviewed_sprot_08042025.csv

Where reviewed_sprot_08042025.csv is the example database (i.e. a csv file with the following columns)

You can see ours which is zipped in the data folder (you will need to unzip to use this.)

Entry	Sequence	Residue
A0A009IHW8	MSLEQKKGADIIS	207
A0A023I7E1	MRFQVIVAAATITMIY	499\|577\|581
A0A024B7W1	MKNPKKKSGGFRIV	1552\|1576\|1636\|2580\|2665\|2701\|2737
A0A024RXP8	MYRKLAVISAFL	228\|233

A threshold can be selected for sequence identity between the query and the blast database, such that Squidly will be used under a certain threshold. This is because BLAST tends to outperform all currently available ML models at high sequence identities

Running a single Squidly model (non ensembled)

squidly run example.fasta esm2_t36_3B_UR50D output_folder/ --single-model --cr-model-as squidly/models/3B/CataloDB_esm2_t36_3B_UR50D_CR_1.pt --lstm-model-as squidly/models/3B/CataloDB_esm2_t36_3B_UR50D_LSTM_1.pth

Squidly args

 squidly --help

 Usage: squidly [OPTIONS] COMMAND [ARGS]...

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion          Install completion for the current shell.                                                                                              │
│ --show-completion             Show completion for the current shell, to copy it or customize the installation.                                                       │
│ --help                        Show this message and exit.                                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ install   Install the models for the package.                                                                                                                        │
│ run       Find catalytic residues using Squidly and BLAST.                                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

squidly run --help
 Usage: squidly run [OPTIONS] FASTA_FILE ESM2_MODEL [OUTPUT_FOLDER] [RUN_NAME]

 Find catalytic residues using Squidly and BLAST.
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    fasta_file         TEXT             Full path to query fasta (note have simple IDs otherwise we'll remove all funky characters.) [required]                                                                                                                                                                                                                       │
│ *    esm2_model         TEXT             Name of the esm2_model, esm2_t36_3B_UR50D or esm2_t48_15B_UR50D [required]                                                                                                                                                                                                                                                    │
│      output_folder      [OUTPUT_FOLDER]  Where to store results (full path!) [default: Current Directory]                                                                                                                                                                                                                                                              │
│      run_name           [RUN_NAME]       Name of the run [default: squidly]                                                                                                                                                                                                                                                                                            │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --single-model       --no-single-model             Whether or not to use single model instead of the ensemble. We recommend the ensemble. It is faster than the single model version. [default: no-single-model]                                                                                                                                                       │
│ --model-folder                            TEXT     Full path to the model folder.                                                                                                                                                                                                                                                                                      │
│ --database                                TEXT     Full path to database csv (if you want to do the ensemble), needs 3 columns: 'Entry', 'Sequence', 'Residue' where residue is a | separated list of residues. See default DB provided by Squidly. [default: None]                                                                                                    │
│ --cr-model-as                             TEXT     Contrastive learning model for the catalytic residue prediction when not using the ensemble. Ensure it matches the esm model.                                                                                                                                                                                       │
│ --lstm-model-as                           TEXT     LSTM model for the catalytic residue prediction when not using the ensemble. Ensure it matches the esm model.                                                                                                                                                                                                       │
│ --as-threshold                            FLOAT    When using the single squidly models, you must specify a prediction threshold. We found >0.9 to work best in practice, depending on the model. [default: 0.95]                                                                                                                                                      │
│ --blast-threshold                         FLOAT    Sequence identity with which to use Squidly over BLAST defualt 0.3 (meaning for seqs with < 0.3 identity in the DB use Squidly). [default: 0.3]                                                                                                                                                                     │
│ --chunk                                   INTEGER  Max chunk size for the dataset. This is useful for when running Squidly on >50000 sequences as memory is storing intermediate results during inference. [default: 0]                                                                                                                                                │
│ --mean-prob                               FLOAT    Mean probability threshold used in the ensemble. [default: 0.6]                                                                                                                                                                                                                                                     │
│ --mean-var                                FLOAT    Mean variance cutoff used in the ensemble. [default: 0.225]                                                                                                                                                                                                                                                         │
│ --filter-blast       --no-filter-blast             Only run on the ones that didn't have a BLAST residue. [default: filter-blast]                                                                                                                                                                                                                                      │
│ --help                                             Show this message and exit.                                                                                                                                                                                                                                                                                         │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Threshold selection:

The below figure showcases Squidly's performance on the CataloDB benchmark at varying thresholds for probability and variance in the ensemble model.

Optimal thresholds for specific uses may vary. Lower probability thresholds have been found to work in practice when preicting certain EC numbers: perEC_threshold_figure

Data Availability

All datasets used in the paper are available here https://zenodo.org/records/15541320.

Reproducing Squidly

We developed reproduction scripts for each benchmark training/test scenario.

AEGAN and Common Benchmarks: Trained on Uni14230 (AEGAN), and tested on Uni3175 (AEGAN), HA_superfamily, NN, PC, and EF datasets.
CataloDB: Trained on a curated training and test set with structural/sequence ID filtering to less than 30% identity.

The corresponding scripts can be found in the reproduction_run directory.

Before running them, download the datasets.zip file from zenodo and place them and unzip it in the base directory of Squidly.

Datasets: https://zenodo.org/records/15541320

Model weights: https://huggingface.co/WillRieger/Squidly

python reproduction_scripts/reproduce_squidly_CataloDB.py --scheme 2 --sample_limit 16000 --esm2_model esm2_t36_3B_UR50D --reruns 1

You must choose the pair scheme for the Squidly models:

Scheme 2 and 3 had the sample limit parameter set to 16000, and scheme 1 at 4000000.

You must also correctly specify the ESM2 model used. You can either use esm2_t36_3B_UR50D or esm2_t48_15B_UR50D. The scripts will automatically download these if specified like so. You may also instead provide your own path to the models if you have them downloaded somewhere.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 10, 2026

0.0.9

Nov 11, 2025

0.0.8

Nov 11, 2025

0.0.7

Oct 24, 2025

0.0.6

Oct 9, 2025

0.0.5

Aug 9, 2025

0.0.4

Aug 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

squidly-0.1.0.tar.gz (23.3 kB view details)

Uploaded Jan 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

squidly-0.1.0-py3-none-any.whl (25.7 kB view details)

Uploaded Jan 10, 2026 Python 3

File details

Details for the file squidly-0.1.0.tar.gz.

File metadata

Download URL: squidly-0.1.0.tar.gz
Upload date: Jan 10, 2026
Size: 23.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for squidly-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`624865f66b3f9b172d33627ddfe9fa9e98d55b6b359fd628f37e358608630893`
MD5	`611da6af369e0e6cd549c6ed9a807270`
BLAKE2b-256	`894ea358c5cf284a74f0c7120718f84a8cfb66f694881ea0d752def2d56f9c97`

See more details on using hashes here.

File details

Details for the file squidly-0.1.0-py3-none-any.whl.

File metadata

Download URL: squidly-0.1.0-py3-none-any.whl
Upload date: Jan 10, 2026
Size: 25.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for squidly-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`722e16a503260e756a0bc6f8d68effe6ee97ca3b38485a18022770ff89044e20`
MD5	`3ea08a9a53fc35a70bcd9a2d85a3b554`
BLAKE2b-256	`b50ce0bad2cc13354c1ffd902a5e23b6b6e4b2c03db7e570234a74918221dca5`

See more details on using hashes here.

squidly 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Squidly

Requirements

Installation

Usage

Squidly args

Threshold selection:

Data Availability

Reproducing Squidly

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes