No project description provided

These details have not been verified by PyPI

Project links

Project description

A pipeline for enzyme engineering

Enzyme-tk is a collection of tools for enzyme engineering, setup as interoperable modules that act on dataframes. These modules are designed to be imported into pipelines for specific function. For this reason, steps as each module is called (e.g. finding similar proteins with BLAST would be considered a step) are designed to be as light as possible. An example of a pipeline is the annotate-e ` pipeline, this acts to annotate a fasta with an ensemble of methods (each is designated as an Enzyme-tk step).

If you have any issues installing, let me know - this has been tested only on Linux/Ubuntu. Please post an issue!

Installation

Install base package to import modules

conda create --name enzymetk python==3.12 -y
pip install enzymetk
# Install torch for your specific cuda version
pip install torch torchvision #--index-url https://download.pytorch.org/whl/cu130

If you're at the bleeding edge, and going to use older models e.g. chemBERTa2 you may need to run

pip uninstall transformers -y
pip install "transformers<5"

For each module run install the first time you're running it

This will install as a venv where possible and conda where the tools don't allow for venvs. See specific tools for info.

bm = BLAST(id_col, seq_col, label_col)
bm.install() # by default will create a venv or if needed a conda env

Note if you want to use your specific environment you can install externally and override the installed venv or conda env e.g.

bm = BLAST(id_col, seq_col, label_col)
bm.conda = 'blast_env' # an already installed env on your computer
bm.venv = None # so it knows to use conda i.e. forces it not to use venv

Modules requiring conda

CREEP [not tested again]
CLEAN [not tested again]
ProteInfer [not tested again]

Modules able to run in venv

BLAST [cpu, tested with both, see notebook]
ChemBERTA [cpu, colab]
Boltz
Chai: conda install -c conda-forge pdbfixer
esm2/3 [cpu, see notebook]
foldseek [tested and works]
ligandmpnn
mmseqs [can get working...]
msa []
reaction_similarity [good, cpu]
rxnfp [needs specific python version so not easy in colab] hence install is with enzymetk install rxnfp requires conda
substrate_similarity [good, cpu]
tree
unimol [good, cpu]

Docko git@github.com:ArianeMora/docko.git ValueError: CCD component ALA not found! boltz predict boltz.fasta --use_msa_server --cache ./mol

srun -p gpu --qos=normal --gres=gpu:1 --pty --mem=64G --time=000:30:00 bash

pipelines: reads --> poreChop --> Flye --> Prokka --> Squidly --> Foldseek --> Boltz --> Chai pipelines: seqs --> BLAST --> Proteinfer --> Foldseek --> MMseqs --> ClustalOmega --> FastTree pipelines: reactions --> rxnFP --> selformer --> uniMol --> chemBERTa2 --> RDkit reaction similarity

Module	Name	Description	Colab ipynb
Metagenomics	PoreChop	Used to filter adapters for nanopore sequences in metagenomics pipeline.	y
Metagenomics	Flye	Used to assemble the metagenomes.	?
Metagenomics	Prokka	Annotation of genes within the genome.	?
Function prediction	Proteinfer	Annotation of genes to function (GO or EC class) using ML.	33
Function prediction	CLEAN	Annotation of genes to EC class using ML.	11
Function prediction	CREEP	Annotation of genes to EC class using ML.	13
Function prediction	Func-e	Annotation of genes to reaction using ML.	This study.
Function prediction	Squidly	Annotation of catalytic residues using ML.	36
Embedding generation	ESM2 & 3	Conversion of amino acid sequence to a numerical embedding using a PLM.	46,47
Embedding generation	RxnFP	Conversion of reaction smiles to a numerical embedding using a language model.	48
Embedding generation	Selformer	Conversion of reaction selfies to a numerical embedding using a language model.	49
Embedding generation	Uni-mol	Conversion of molecule smiles to a numerical embedding using a language model.	50
Embedding generation	ChemBERTa2	Conversion of reaction smiles to a numerical embedding using a language model.	51
Docking	Chai	Diffusion based folding of a protein and ligand.	42
Docking	Boltz	Diffusion based folding of a protein and ligand.	52
Similarity	Diamond	Sequence similarity calculation using basic local alignment search.	53
Similarity	Foldseek	Fast structure similarity search.	54
Similarity	MMseqs	Fast sequence clustering.	55
Docking	StructureZyme	Alignment and calculation of structure metrics.	56
Oligo design	Oligopoolio	Calculation of oligo fragments for gene assembly.	This study.
Sequencing	LevSeq	Sequence verification of protein variants.	34
MSA generation	ClustalOmega	Creation of multiple sequence alignments (MSA).	57
Phylogenetic tree generation	FastTree	Creation of multiple phylogenetic trees.	58

Install only the specific requirements you need (recomended)

For this clone the repo and then install the requirements for the specific modules you use

git clone git@github.com:ArianeMora/enzyme-tk.git
cd enzymetk/conda_envs/ # would recommend looking at thes
# e.g. to install all from within that folder you would do
source install_all.sh

Usage

If you have any issues at all just email me using my caltech email: amora at caltech . edu

This is a work-in progress! e.g. some tools (e.g. proteInfer and CLEAN) require extra data to be downloaded in order to run (like model weights.) I'm working on integrating these atm, buzz me if you need this!

Here are some of the tools that have been implemented to be chained together as a pipeline:

boltz2 mmseqs2
foldseek
diamond
proteinfer
CLEAN
chai
chemBERTa2
SELFormer
rxnfp
clustalomega
CREEP
esm
LigandMPNN
vina
Uni-Mol
fasttree
Porechop
prokka

Things to note

All the tools use the conda env of enzymetk by default.

If you want to use a different conda env, you can do so by passing the env_name argument to the constructor of the step.

For example:

proteinfer = ProteInfer(env_name='proteinfer')

Arguments

All the arguments are passed to the constructor of the step, the ones that are required are passed as arguments to the constructor and the ones that are optional are passed as a list to the args argument, this needs to be a list as one would normally pass arguments to a command line tool.

For example:

proteinfer = ProteInfer(env_name='proteinfer', args=['--num_threads', '10'])

For those wanting to use specific arguments, check the individual tools for specifics.

Steps

The steps are the main building blocks of the pipeline. They are responsible for executing the individual tools.

BLAST

BLAST is a tool for searching a database of sequences for similar sequences. Here you can either pass a database that you have already created or pass the sequences as part of your dataframe and pass the label column (this needs to have two values: reference and query) reference refers to sequences that you want to search against and query refers to sequences that you want to search for.

Note you can install 2 ways, with a conda env by command line:

enzymetk install_diamond

id_col = 'Entry'
seq_col = 'Sequence'
label_col = 'label'
rows = [['AXE2_TALPU', 'query', 'MHSKFFAASLLGLGAAAIPLEGVMEKRSCPAIHVFGARETTASPGYGSSSTVVNGVLSAYPGSTAEAINYPACGGQSSCGGASYSSSVAQGIAAVASAVNSFNSQCPSTKIVLVGYSQGGEIMDVALCGGGDPNQGYTNTAVQLSSSAVNMVKAAIFMGDPMFRAGLSYEVGTCAAGGFDQRPAGFSCPSAAKIKSYCDASDPYCCNGSNAATHQGYGSEYGSQALAFVKSKLG'],
        ['AXE2_TALPU', 'reference', 'MHSKFFAASLLGLGAAAIPLEGVMEKRSCPAIHVFGARETTASPGYGSSSTVVNGVLSAYPGSTAEAINYPACGGQSSCGGASYSSSVAQGIAAVASAVNSFNSQCPSTKIVLVGYSQGGEIMDVALCGGGDPNQGYTNTAVQLSSSAVNMVKAAIFMGDPMFRAGLSYEVGTCAAGGFDQRPAGFSCPSAAKIKSYCDASDPYCCNGSNAATHQGYGSEYGSQALAFVKSKLG'],
        ['AXE2_GEOSE', 'reference', 'MKIGSGEKLLFIGDSITDCGRARPEGEGSFGALGTGYVAYVVGLLQAVYPELGIRVVNKGISGNTVRDLKARWEEDVIAQKPDWVSIMIGINDVWRQYDLPFMKEKHVYLDEYEATLRSLVLETKPLVKGIILMTPFYIEGNEQDPMRRTMDQYGRVVKQIAEETNSLFVDTQAAFNEVLKTLYPAALAWDRVHPSVAGHMILARAFLREIGFEWVRSR'], 
        ['AXE7A_XYLR2', 'referece', 'MFNFAPKQTTEMKKLLFTLVFVLGSMATALAENYPYRADYLWLTVPNHADWLYKTGERAKVEVSFCLYGMPQNVEVAYEIGPDMMPATSSGKVTLKNGRAVIDMGTMKKPGFLDMRLSVDGKYQHHVKVGFSPELLKPYTKNPQDFDAFWKANLDEARKTPVSVSCNKVDKYTTDAFDCYLLKIKTDRRHSIYGYLTKPKKAGKYPVVLCPPGAGIKTIKEPMRSTFYAKNGFIRLEMEIHGLNPEMTDEQFKEITTAFDYENGYLTNGLDDRDNYYMKHVYVACVRAIDYLTSLPDWDGKNVFVQGGSQGGALSLVTAGLDPRVTACVANHPALSDMAGYLDNRAGGYPHFNRLKNMFTPEKVNTMAYYDVVNFARRITCPVYITWGYNDNVCPPTTSYIVWNLITAPKESLITPINEHWTTSETNYTQMLWLKKQVK'], 
        ['A0A0B8RHP0_LISMN', 'reference', 'MKKLLFLGDSVTDAGRDFENDRELGHGYVKIIADQLEQEDVTVINRGVSANRVADLHRRIEADAISLQPDVVTIMIGINDTWFSFSRWEDTSVTAFKEVYRVILNRIKTETNAELILMEPFVLPYPEDRKEWRGDLDPKIGAVRELAAEFGATLIPLDGLMNALAIKHGPTFLAEDGVHPTKAGHEAIASTWLEFTK']]
df = pd.DataFrame(rows, columns=[id_col, label_col, seq_col])
df << (BLAST(id_col, seq_col, label_col) >> Save('tmp/blast_test.pkl'))

ActiveSitePred

ActiveSitePred is a tool for predicting the active site of an enzyme. This returns a dataframe with the active site prediction for each sequence, and the probability of the active site. Note we use a zero index for the active site prediction while UniProt uses a one index.

squidly_dir = '/disk1/share/software/AS_inference/' # This should be where you downloaded the data from zotero, there is a folder in there called AS_inference
num_threads = 1
id_col = 'Entry'
seq_col = 'Sequence'
rows = [['AXE2', 'MKIGSGEKLLFIGDSITDCGRARPEGEGSFGALGTGYVAYVVGLLQAVYPELGIRVVNKGISGNTVRDLKARWEEDVIAQKPDWVSIMIGINDVWRQYDLPFMKEKHVYLDEYEATLRSLVLETKPLVKGIILMTPFYIEGNEQDPMRRTMDQYGRVVKQIAEETNSLFVDTQAAFNEVLKTLYPAALAWDRVHPSVAGHMILARAFLREIGFEWVRSR'], 
        ['H7C0D0', 'XRAHREIKDIFYKAIQKRRQSQEKIDDILQTLLDATYKDGRPLTDDEVAGMLIGLLLAGQHTSSTTSAWMGFFLARDKTLQKKCYLEQKTVCGENLPPLTYDQLKDLNLLDRCIKETLRLRPPIMIMMRMARTPQTVAGYTIPPGHQDNPASGEKFAYVPFGAGRHRCIGENFAYVQIKTIWSTMLRLYEFDLIDGYFPTVNYTTMIHTPENPVIRYKRRSK']]
df = pd.DataFrame(rows, columns=[id_col, seq_col])
print(df)
df << (ActiveSitePred(id_col, seq_col, squidly_dir, num_threads) >> Save('tmp/squidly_as_pred.pkl'))

Boltz2

Boltz2 is a model for predicting structures. Note you need docko installed as I run via that.

Below is an example using boltz with 4 threads, and uses a cofactor (intermediate in this case). Just set to be None for a single substrate version.

import sys
from enzymetk.dock_boltz_step import Boltz
from enzymetk.save_step import Save
import pandas as pd
import os
os.environ['MKL_THREADING_LAYER'] = 'GNU'

output_dir = 'tmp/'
num_threads = 4
id_col = 'Entry'
seq_col = 'Sequence'
substrate_col = 'Substrate'
intermediate_col = 'Intermediate'

rows = [['P0DP23_boltz_8999', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'], 
        ['P0DP24_boltz_p1', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
        ['P0DP23_boltz_p2', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'], 
        ['P0DP24_boltz_p3', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]'],
        ['P0DP24_boltz_p4', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC', 'CC1=C(C2=CC3=C(C(=C([N-]3)C=C4C(=C(C(=N4)C=C5C(=C(C(=N5)C=C1[N-]2)C)C=C)C)C=C)C)CCC(=O)[O-])CCC(=O)[O-].[Fe]']]
df = pd.DataFrame(rows, columns=[id_col, seq_col, substrate_col, intermediate_col])
df << (Boltz(id_col, seq_col, substrate_col, intermediate_col, f'{output_dir}', num_threads) >> Save(f'{output_dir}test.pkl'))

Chai

Chai is a tool for predicting the structure of a protein and a ligand, this tool outputs the data to a new folder and creates directories based on the id that is passed. We return the paths to the specific structure for each id in the returned dataframe.

Requres the docko conda environment to be created.

output_dir = 'tmp/'
num_threads = 1
id_col = 'Entry'
seq_col = 'Sequence'
substrate_col = 'Substrate'
rows = [['P0DP23', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC'], 
        ['AXE2', 'MKIGSGEKLLFIGDSITDCGRARPEGEGSFGALGTGYVAYVVGLLQAVYPELGIRVVNKGISGNTVRDLKARWEEDVIAQKPDWVSIMIGINDVWRQYDLPFMKEKHVYLDEYEATLRSLVLETKPLVKGIILMTPFYIEGNEQDPMRRTMDQYGRVVKQIAEETNSLFVDTQAAFNEVLKTLYPAALAWDRVHPSVAGHMILARAFLREIGFEWVRSR', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC']]
df = pd.DataFrame(rows, columns=[id_col, seq_col, substrate_col])
print(df)
df << (Chai(id_col, seq_col, substrate_col, f'{output_dir}', num_threads) >> Save(f'{output_dir}test.pkl'))

ChemBERTa

ChemBERTa2 encodes reactions and SMILES strings into a vector space. Note this requires the base environment, i.e. enzymetk conda env.

from enzymetk.embedchem_chemberta_step import ChemBERT
from enzymetk.save_step import Save

output_dir = 'tmp/'
num_threads = 1
id_col = 'Entry'
seq_col = 'Sequence'
substrate_col = 'Substrate'
rows = [['P0DP23', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC'], 
        ['P0DP24', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC']]
df = pd.DataFrame(rows, columns=[id_col, seq_col, substrate_col])
new_df = (df << (ChemBERT(id_col, substrate_col, num_threads) >> Save(f'{output_dir}chemberta.pkl')))

CLEAN

CLEAN is a tool for predicting the EC number of an enzyme.

output_dir = 'tmp/'
num_threads = 1
id_col = 'Entry'
seq_col = 'Sequence'
substrate_col = 'Substrate'
rows = [['P0DP23', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC'], 
        ['AXE2', 'MKIGSGEKLLFIGDSITDCGRARPEGEGSFGALGTGYVAYVVGLLQAVYPELGIRVVNKGISGNTVRDLKARWEEDVIAQKPDWVSIMIGINDVWRQYDLPFMKEKHVYLDEYEATLRSLVLETKPLVKGIILMTPFYIEGNEQDPMRRTMDQYGRVVKQIAEETNSLFVDTQAAFNEVLKTLYPAALAWDRVHPSVAGHMILARAFLREIGFEWVRSR', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC']]
df = pd.DataFrame(rows, columns=[id_col, seq_col, substrate_col])
# This should be relative to the location of the script if you installed via the install_all.sh script
# Note you need to have downloaded their predictive models (ToDo )
clean_dir = 'software/CLEAN/app/'
df << (CLEAN(id_col, seq_col, clean_dir, num_threads=num_threads) >> Save(f'clean_missing_EC_seqs.pkl'))

ClustalOmega

ClustalOmega is a tool for aligning a set of sequences. This gets installed to the system (expecting a linux machine) and added to the bash path. You need to have installed it first (check out the conda_envs directory in enzymetk.)

from enzymetk.generate_msa_step import ClustalOmega
from enzymetk.save_step import Save
import pandas as pd

id_col = 'Entry'
seq_col = 'Sequence'
label_col = 'label'
rows = [['AXE2_TALPU', 'query', 'MHSKFFAASLLGLGAAAIPLEGVMEKRSCPAIHVFGARETTASPGYGSSSTVVNGVLSAYPGSTAEAINYPACGGQSSCGGASYSSSVAQGIAAVASAVNSFNSQCPSTKIVLVGYSQGGEIMDVALCGGGDPNQGYTNTAVQLSSSAVNMVKAAIFMGDPMFRAGLSYEVGTCAAGGFDQRPAGFSCPSAAKIKSYCDASDPYCCNGSNAATHQGYGSEYGSQALAFVKSKLG'],
        ['AXE2_TALPU', 'reference', 'MHSKFFAASLLGLGAAAIPLEGVMEKRSCPAIHVFGARETTASPGYGSSSTVVNGVLSAYPGSTAEAINYPACGGQSSCGGASYSSSVAQGIAAVASAVNSFNSQCPSTKIVLVGYSQGGEIMDVALCGGGDPNQGYTNTAVQLSSSAVNMVKAAIFMGDPMFRAGLSYEVGTCAAGGFDQRPAGFSCPSAAKIKSYCDASDPYCCNGSNAATHQGYGSEYGSQALAFVKSKLG'],
        ['AXE2_GEOSE', 'reference', 'MKIGSGEKLLFIGDSITDCGRARPEGEGSFGALGTGYVAYVVGLLQAVYPELGIRVVNKGISGNTVRDLKARWEEDVIAQKPDWVSIMIGINDVWRQYDLPFMKEKHVYLDEYEATLRSLVLETKPLVKGIILMTPFYIEGNEQDPMRRTMDQYGRVVKQIAEETNSLFVDTQAAFNEVLKTLYPAALAWDRVHPSVAGHMILARAFLREIGFEWVRSR'], 
        ['AXE7A_XYLR2', 'referece', 'MFNFAPKQTTEMKKLLFTLVFVLGSMATALAENYPYRADYLWLTVPNHADWLYKTGERAKVEVSFCLYGMPQNVEVAYEIGPDMMPATSSGKVTLKNGRAVIDMGTMKKPGFLDMRLSVDGKYQHHVKVGFSPELLKPYTKNPQDFDAFWKANLDEARKTPVSVSCNKVDKYTTDAFDCYLLKIKTDRRHSIYGYLTKPKKAGKYPVVLCPPGAGIKTIKEPMRSTFYAKNGFIRLEMEIHGLNPEMTDEQFKEITTAFDYENGYLTNGLDDRDNYYMKHVYVACVRAIDYLTSLPDWDGKNVFVQGGSQGGALSLVTAGLDPRVTACVANHPALSDMAGYLDNRAGGYPHFNRLKNMFTPEKVNTMAYYDVVNFARRITCPVYITWGYNDNVCPPTTSYIVWNLITAPKESLITPINEHWTTSETNYTQMLWLKKQVK'], 
        ['A0A0B8RHP0_LISMN', 'reference', 'MKKLLFLGDSVTDAGRDFENDRELGHGYVKIIADQLEQEDVTVINRGVSANRVADLHRRIEADAISLQPDVVTIMIGINDTWFSFSRWEDTSVTAFKEVYRVILNRIKTETNAELILMEPFVLPYPEDRKEWRGDLDPKIGAVRELAAEFGATLIPLDGLMNALAIKHGPTFLAEDGVHPTKAGHEAIASTWLEFTK']]
df = pd.DataFrame(rows, columns=[id_col, label_col, seq_col])
df << (ClustalOmega(id_col, seq_col) >> Save('tmp/clustalomega_test.pkl'))

CREEP

CREEP is a tool for predicting the EC number of a reaction. At the moment it only supports reactions to EC however we are extending this to other modalities.

from enzymetk.annotateEC_CREEP_step import CREEP
from enzymetk.save_step import Save
import pandas as pd

# CREEP expects you to have downloaded the data from the zotero page and put it in the data/CREEP folder
output_dir = 'tmp/'
df = pd.DataFrame({'EC number': ['1.1.1.1', '1.1.1.2'], 
                   'Sequence': ['MALWMRLLPLLALLALWGPDPAAA', 'MALWMRLLPLLALLALWGPDPAAA'], 
                   'Reaction': ['O=P(OC1=CC=CC=C1)(OC2=CC=CC=C2)OC3=CC=CC=C3>>O=P(O)(OC4=CC=CC=C4)OC5=CC=CC=C5.OC6=CC=CC=C6',
                                'O=P(OC1=CC=CC=C1)(OC2=CC=CC=C2)OC3=CC=CC=C3>>O=P(O)(OC4=CC=CC=C4)OC5=CC=CC=C5.OC6=CC=CC=C6']})
id_col = 'Entry'
reaction_col = 'Reaction'

df << (CREEP(id_col, reaction_col, CREEP_cache_dir='/disk1/share/software/CREEP/data/', CREEP_dir='/disk1/share/software/CREEP/',
            modality='reaction', reference_modality='protein') >> Save(f'{output_dir}CREEP_test_protein.pkl'))

EmbedESM

EmbedESM is a tool for embedding a set of sequences using ESM2.

Either in your own conda env: pip install esm-fair or you can run:

id_col = 'Entry'
seq_col = 'Sequence'
label_col = 'ActiveSite'
esm = EmbedESM(id_col, seq_col, extraction_method='mean', tmp_dir='tmp', rep_num=36) # i.e. the representation number you want usually the last layer 
esm.install() # And follow the instructions to activate the env

from enzymetk.embedprotein_esm_step import EmbedESM
from enzymetk.save_step import Save
import pandas as pd

id_col = 'Entry'
seq_col = 'Sequence'
label_col = 'ActiveSite'
rows = [['AXE2_TALPU', '10', 'MHSKFFAASLLGLGAAAIPLEGVMEKRSCPAIHVFGARETTASPGYGSSSTVVNGVLSAYPGSTAEAINYPACGGQSSCGGASYSSSVAQGIAAVASAVNSFNSQCPSTKIVLVGYSQGGEIMDVALCGGGDPNQGYTNTAVQLSSSAVNMVKAAIFMGDPMFRAGLSYEVGTCAAGGFDQRPAGFSCPSAAKIKSYCDASDPYCCNGSNAATHQGYGSEYGSQALAFVKSKLG'],
        ['AXE2_GEOSE', '1|2', 'MKIGSGEKLLFIGDSITDCGRARPEGEGSFGALGTGYVAYVVGLLQAVYPELGIRVVNKGISGNTVRDLKARWEEDVIAQKPDWVSIMIGINDVWRQYDLPFMKEKHVYLDEYEATLRSLVLETKPLVKGIILMTPFYIEGNEQDPMRRTMDQYGRVVKQIAEETNSLFVDTQAAFNEVLKTLYPAALAWDRVHPSVAGHMILARAFLREIGFEWVRSR'], 
        ['AXE7A_XYLR2', '1', 'MFNFAPKQTTEMKKLLFTLVFVLGSMATALAENYPYRADYLWLTVPNHADWLYKTGERAKVEVSFCLYGMPQNVEVAYEIGPDMMPATSSGKVTLKNGRAVIDMGTMKKPGFLDMRLSVDGKYQHHVKVGFSPELLKPYTKNPQDFDAFWKANLDEARKTPVSVSCNKVDKYTTDAFDCYLLKIKTDRRHSIYGYLTKPKKAGKYPVVLCPPGAGIKTIKEPMRSTFYAKNGFIRLEMEIHGLNPEMTDEQFKEITTAFDYENGYLTNGLDDRDNYYMKHVYVACVRAIDYLTSLPDWDGKNVFVQGGSQGGALSLVTAGLDPRVTACVANHPALSDMAGYLDNRAGGYPHFNRLKNMFTPEKVNTMAYYDVVNFARRITCPVYITWGYNDNVCPPTTSYIVWNLITAPKESLITPINEHWTTSETNYTQMLWLKKQVK'], 
        ['A0A0B8RHP0_LISMN', '2', 'MKKLLFLGDSVTDAGRDFENDRELGHGYVKIIADQLEQEDVTVINRGVSANRVADLHRRIEADAISLQPDVVTIMIGINDTWFSFSRWEDTSVTAFKEVYRVILNRIKTETNAELILMEPFVLPYPEDRKEWRGDLDPKIGAVRELAAEFGATLIPLDGLMNALAIKHGPTFLAEDGVHPTKAGHEAIASTWLEFTK']]
df = pd.DataFrame(rows, columns=[id_col, label_col, seq_col])
df << (EmbedESM(id_col, seq_col, extraction_method='mean', tmp_dir='tmp/') >> Save('tmp/esm2_test.pkl'))
# You can also extract the active site embedding in addition to the mean embedding
df << (EmbedESM(id_col, seq_col, extraction_method='active_site', active_site_col='ActiveSite', tmp_dir='tmp/') >> Save('tmp/esm2_test_active_site.pkl'))

FoldSeek

See: FoldSeek

FoldSeek does a similarity search against a database of structures, it runs in the enzyme-tk environment. Similarly to the diamond blast, you can either create databases yourself before hand using the foldseek documentation or you can create a database on the fly by passing the dataframe with a column called label that has two values: reference and query. If you pass a database, you need to pass the path to the database.

The columns expect a path to a pdb file i.e. the output from the Chai step.

from enzymetk.similarity_foldseek_step import FoldSeek
from enzymetk.save_step import Save
import pandas as pd

# id_col: str, seq_col: str, proteinfer_dir: str,
output_dir = 'tmp/'
rows = [['tmp/P0DP24/chai/P0DP24_3.cif'],
        ['tmp/P0DP24/chai/P0DP24_1.cif']]
df = pd.DataFrame(rows, columns=['pdbs'])
# foldseek_dir: str, pdb_column_name: str, reference_database: str
pdb_column_name = 'pdbs'
# The foldseek database was created using the folldwing command in this location:
# foldseek databases PDB pdb tmp 
reference_database = '/disk1/share/software/foldseek/structures/pdb/pdb'
df << (FoldSeek(pdb_column_name, reference_database) >> Save(f'{output_dir}pdb_files.pkl'))

LigandMPNN

LigandMPNN is a tool for inpainting the sequence for a protein backbone that has been generated by a generative model.

See: LigandMPNN

from steps.inpaint_ligandMPNN_step import LigandMPNN
from steps.save_step import Save
import pandas as pd

# id_col: str, seq_col: str, proteinfer_dir: str,
# This needs to be the full path to the file since LigandMPNN requires the full path (otherwise it will save to the ligandmpnn directory)
output_dir = '/disk1/ariane/vscode/enzyme-tk/examples/tmp/'
# These have to be the full path to the file since LigandMPNN requires the full path.
rows = [['/disk1/ariane/vscode/enzyme-tk/examples/tmp/P0DP24/chai/P0DP24_3.cif'],
        ['/disk1/ariane/vscode/enzyme-tk/examples/tmp/P0DP24/chai/P0DP24_1.cif']]
df = pd.DataFrame(rows, columns=['pdbs'])
# foldseek_dir: str, pdb_column_name: str, reference_database: str
pdb_column_name = 'pdbs'
ligand_mpnn_dir = '/disk1/share/software/LigandMPNN/'
# See how you need to enclose the fixed residues in quotes make sure any spaces are closed in double quotes!
args = ['--fixed_residues', '"A19 A20 A21 A59 A60 A61 A90 A91 A92"', '--checkpoint_path_sc', f'{ligand_mpnn_dir}model_params/ligandmpnn_sc_v_32_002_16.pt']
df << (LigandMPNN(pdb_column_name, ligand_mpnn_dir, output_dir,args=args) >> Save(f'{output_dir}ligandmpnn_inpainted.pkl'))

Proteinfer

Proteinfer is a tool for predicting the EC number of an enzyme.

output_dir = 'tmp/'
num_threads = 1
id_col = 'Entry'
seq_col = 'Sequence'
substrate_col = 'Substrate'
rows = [['P0DP23', 'MALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAAMALWMRLLPLLALLALWGPDPAAA', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC'], 
        ['AXE2', 'MKIGSGEKLLFIGDSITDCGRARPEGEGSFGALGTGYVAYVVGLLQAVYPELGIRVVNKGISGNTVRDLKARWEEDVIAQKPDWVSIMIGINDVWRQYDLPFMKEKHVYLDEYEATLRSLVLETKPLVKGIILMTPFYIEGNEQDPMRRTMDQYGRVVKQIAEETNSLFVDTQAAFNEVLKTLYPAALAWDRVHPSVAGHMILARAFLREIGFEWVRSR', 'CCCCC(CC)COC(=O)C1=CC=CC=C1C(=O)OCC(CC)CCCC']]
df = pd.DataFrame(rows, columns=[id_col, seq_col, substrate_col])
# This should be relative to the location of the script if you installed via the install_all.sh script
# Note you need to have downloaded their predictive models (ToDo )
proteinfer_dir = 'software/proteinfer/'
df << (ProteInfer(id_col, seq_col, proteinfer_dir, num_threads=num_threads) >> Save(f'proteinfer.pkl'))

Tools and references

Being a toolkit this is a collection of other tools, which means if you use any of these tools then cite the ones relevant to your work:

mmseqs2
foldseek
diamond
proteinfer
CLEAN
chai
chemBERTa2
SELFormer
rxnfp
clustalomega
CREEP
esm
LigandMPNN
vina
Uni-Mol
fasttree
Porechop
prokka

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.9

Mar 3, 2026

This version

0.0.8

Feb 20, 2026

0.0.7

Feb 2, 2026

0.0.6

Dec 22, 2025

0.0.4

Aug 10, 2025

0.0.3

Aug 9, 2025

0.0.2

Apr 23, 2025

0.0.1

Apr 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enzymetk-0.0.8.tar.gz (45.5 kB view details)

Uploaded Feb 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

enzymetk-0.0.8-py3-none-any.whl (57.4 kB view details)

Uploaded Feb 20, 2026 Python 3

File details

Details for the file enzymetk-0.0.8.tar.gz.

File metadata

Download URL: enzymetk-0.0.8.tar.gz
Upload date: Feb 20, 2026
Size: 45.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for enzymetk-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`286352a81844bfedfa85ddec593d8c6c209f42cabad24fec1577495dc6a4d72d`
MD5	`6248e90da7b801dd5cf12ce699c4bc83`
BLAKE2b-256	`5d4530ce7d99b50c4bbc7e7227523614246e72d7609edb1ed4952ad9b447bfad`

See more details on using hashes here.

File details

Details for the file enzymetk-0.0.8-py3-none-any.whl.

File metadata

Download URL: enzymetk-0.0.8-py3-none-any.whl
Upload date: Feb 20, 2026
Size: 57.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for enzymetk-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8b740569c51d81d8a5c0a36f44bcd0db8b25873f4e8204506a0e6ebd5b5de8c`
MD5	`366c31da7646895d8142ae362504be28`
BLAKE2b-256	`278f0a2dd8004b328f5b9f703092f2ea4b7e3e9389f799537484dc34248ec5bf`

See more details on using hashes here.

enzymetk 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A pipeline for enzyme engineering

Installation

Install base package to import modules

If you're at the bleeding edge, and going to use older models e.g. chemBERTa2 you may need to run

For each module run install the first time you're running it

Modules requiring conda

Modules able to run in venv

Install only the specific requirements you need (recomended)

Usage

Things to note

Arguments

Steps

BLAST

ActiveSitePred

Boltz2

Chai

ChemBERTa

CLEAN

ClustalOmega

CREEP

EmbedESM

FoldSeek

LigandMPNN

Proteinfer

Tools and references

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes