No project description provided
Project description
A pipeline for enzyme engineering
Enzyme-tk is a collection of tools for enzyme engineering, setup as interoperable modules that act on dataframes. These modules are designed to be imported into pipelines for specific function. For this reason, steps as each module is called (e.g. finding similar proteins with BLAST would be considered a step) are designed to be as light as possible. An example of a pipeline is the annotate-e ` pipeline, this acts to annotate a fasta with an ensemble of methods (each is designated as an Enzyme-tk step).
Quick Start Colab notebook
If you want to try a colab notebook here is an example: (colab)
Data link: git clone https://huggingface.co/datasets/arianemora/enzyme-tk
Moving to a new home:
Since I started at AITHYRA this is migrating to a new home at moragroup/enzyme-tk so will be maintaied there.
Quick Start Colab notebook
If you want to try a colab notebook here is an example: (colab)
If you have any issues installing, let me know - this has been tested only on Linux/Ubuntu. Please post an issue!
Installation
Install base package to import modules
conda create --name enzymetk python==3.10 -y
# Install torch for your specific cuda version
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install enzymetk==0.0.7
Install only the specific requirements you need (recommended)
For installation instructions check out the wiki.
Install only the specific requirements you need (recomended)
For this clone the repo and then install the requirements for the specific modules you use
git clone git@github.com:ArianeMora/enzyme-tk.git
cd enzymetk/conda_envs/ # would recommend looking at these
# e.g. to install all from within that folder you would do
source install_all.sh
For more extensive installation instructions check out the wiki.
Usage
If you have any issues at all just email me using my caltech email: amora at aithyra . ac . at
This is a work-in progress! e.g. some tools (e.g. proteInfer and CLEAN) require extra data to be downloaded in order to run (like model weights.) I'm working on integrating these atm, buzz me if you need this!
Here are some of the tools that have been implemented to be chained together as a pipeline:
boltz2
mmseqs2
foldseek
diamond
proteinfer
CLEAN
chai
chemBERTa2
SELFormer
rxnfp
clustalomega
CREEP
esm
LigandMPNN
vina
Uni-Mol
fasttree
Porechop
prokka
Things to note
All the tools use the conda env of enzymetk by default.
If you want to use a different conda env, you can do so by passing the env_name argument to the constructor of the step.
For example:
proteinfer = ProteInfer(env_name='proteinfer')
Arguments
All the arguments are passed to the constructor of the step, the ones that are required are passed as arguments to the constructor and the ones that are optional are passed as a list to the args argument, this needs to be a list as one would normally pass arguments to a command line tool.
For example:
proteinfer = ProteInfer(env_name='proteinfer', args=['--num_threads', '10'])
For those wanting to use specific arguments, check the individual tools for specifics.
Steps
The steps are the main building blocks of the pipeline. They are responsible for executing the individual tools.
Syntax
We use the operator >> to pass the output of one tool to the next. All expect a dataframe as input, and produce a dataframe as output. You can capture the end by using the = sign, or save it.
For example:
df = df << (ActiveSitePred(id_col, seq_col, num_threads, tmp_dir='tmp/') >> EmbedESM(id_col, seq_col, extraction_method='mean',
tmp_dir='tmp/', rep_num=36) >> Save('tmp/esm2_test_active_site.pkl'))
Will run squidly to predict the active sites first, then pass the sequences to ESM2 then save that new dataframe.
You can chain most steps together, some dataframes remove things like the sequence, when it's not necessary so if you find one that can't be chained but would like to use it as part of a pipeline either let me know or just make a pull request!
Tools and references
Being a toolkit this is a collection of other tools, which means if you use any of these tools then cite the ones relevant to your work:
mmseqs2
foldseek
diamond
proteinfer
CLEAN
chai
chemBERTa2
SELFormer
rxnfp
clustalomega
CREEP
esm
LigandMPNN
vina
Uni-Mol
fasttree
Porechop
prokka
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file enzymetk-0.0.9.tar.gz.
File metadata
- Download URL: enzymetk-0.0.9.tar.gz
- Upload date:
- Size: 35.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4d8cc9041236919f03ffb0fd5f4f2a46aa6b3cf783ba6a28f40add25b482508
|
|
| MD5 |
d6893edd7b06cd54f1da86d2a122c864
|
|
| BLAKE2b-256 |
d14b1ef53871f8bb8e2ba78d29eaa891e8ebdb259efac2b87b635142930c85c7
|
File details
Details for the file enzymetk-0.0.9-py3-none-any.whl.
File metadata
- Download URL: enzymetk-0.0.9-py3-none-any.whl
- Upload date:
- Size: 52.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f60df97d598caf88cc540f5b8721243a2f0c6e225d6e1635f5cc90c833e56fea
|
|
| MD5 |
e15eb5ce50e97e5a5274f845f2fbd9cc
|
|
| BLAKE2b-256 |
c32ceeb1cda23304c120f69c8ba53015841d618347dba8e7cb4dd9b01372cbe9
|