molfeat plugin that leverages the most hyped LLM models in NLP for molecular featurization
Project description
:comet: molfeat-hype
☄️ molfeat-hype - A molfeat plugin that leverages the most hyped LLM models in NLP for molecular featurization.
Overview
molfeat-hype
is an extension to molfeat
that investigates the performance of embeddings from various LLMs trained without explicit molecular context, for molecular modelling. It leverages some of the most hyped LLM models in NLP to answer the following question:
Is it necessary to pretrain/finetune LLMs on molecular context to obtain good molecular representations?
To find an answer to this question, check out the benchmarks
Spoilers
NO ! Understanding of molecular context/structure/properties is key for building good molecular featurizers.LLMs:
molfeat-hype
supports two types of LLM embeddings:
-
Classic Embeddings: These are classical embeddings provided by foundation models (or any LLMs). The models available in this tool include OpenAI's
openai/text-embedding-ada-002
model,llama
, and several embedding models accessible throughsentence-transformers
. -
Instruction-based Embeddings: These are models that have been trained to follow instructions (thus acting like ChatGPT) or are conditional models that require a prompt.
- Prompt-based instruction: A model (like Chat-GPT:
openai/gpt-3.5-turbo
) is asked to act like an all-knowing AI assistant for drug discovery and provide the best molecular representation for the input list of molecules. Here, we parse the representation from the Chat agent output. - Conditional embeddings: A model trained for conditional text embeddings that takes instruction as additional input. Here, the embedding is the model underlying representation of the molecule conditioned by the instructions it received. For more information, see this instructor-embedding.
- Prompt-based instruction: A model (like Chat-GPT:
Installation
You can install molfeat-hype
using either of the following commands:
mamba install -c conda-forge molfeat-hype
or
pip install molfeat-hype
molfeat-hype
mostly depends on molfeat and langchain. For a list of complete dependencies, please see the env.yml file.
Acknowledgements
Check out the following projects that made molfeat-hype possible:
-
To learn more about
molfeat
, please visit https://molfeat.datamol.io/. To learn more about the plugin system of molfeat, please see extending molfeat -
Please refer to the
langchain
documentation for any questions related to langchain.
Usage
The following example shows how to use the molfeat-hype
plugin package automatically when installed. All scenarios highlighted in this example are valid:
- Using directly this package
from molfeat_hype.trans.llm_embeddings import LLMEmbeddingsTransformer
mol_transf = LLMEmbeddingsTransformer(kind="openai/text-embedding-ada-002")
- enable autodiscovery as plugin in molfeat and addition of all embedding classes as importable attribute to the entry point group
molfeat.trans.pretrained
# put this somewhere in you code (e.g in the root __init__ file)
from molfeat.plugins import load_registered_plugins
load_registered_plugins(add_submodules=True, plugins=["molfeat_hype"])
# this is now possible everywhere
from molfeat.trans.pretrained import LLMEmbeddingsTransformer
mol_transf = LLMEmbeddingsTransformer(kind="openai/text-embedding-ada-002")
Changelog
See the latest changelogs at CHANGELOG.rst.
Maintainers
- @maclandrol
Contributing
As an open-source project in a rapidly developing field, we are extremely open to contributions, whether in the form of new features, improved infrastructure, or better documentation. For detailed information on how to contribute, see our contribution guide.
Disclaimer
This repository contains an experimental investigation of LLM embeddings for molecules. Please note that the consistency and usefulness of the returned molecular embeddings are not guaranteed. This project is meant for fun and exploratory purposes only and should not be used as a demonstration of LLM capabilities for molecular embeddings. Any statements made in this repository are the opinions of the authors and do not necessarily reflect the views of any affiliated organizations or individuals. Use at your own risk.
License
Under the Apache-2.0 license. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file molfeat-hype-0.0.3.tar.gz
.
File metadata
- Download URL: molfeat-hype-0.0.3.tar.gz
- Upload date:
- Size: 69.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d42d33e71116ae4609d4651dfb3f45344ad0efa280eaebcfc6abce9f95583fec |
|
MD5 | b464e260c21d32320bd8ed79320160f4 |
|
BLAKE2b-256 | 5c6af4466c75878d0488a7278c73bf915f294a4b6b3285ed6a18749daddb4ed7 |
File details
Details for the file molfeat_hype-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: molfeat_hype-0.0.3-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e23e79aea8ec11f7064870dd9deebba41714a38fbdd9f8dcceb57b96dcdb0ea3 |
|
MD5 | 667dca9042fd5d1d183ceea152eacc6e |
|
BLAKE2b-256 | 9d736c0ba9b676a54bcb89203512e73309d3efbc691c92c01e40e9253b4af9a8 |