Skip to main content

molfeat plugin that leverages the most hyped LLM models in NLP for molecular featurization

Project description

:comet: molfeat-hype

☄️ molfeat-hype - A molfeat plugin that leverages the most hyped LLM models in NLP for molecular featurization.

Docs


PyPI PyPI - Python Version license test code-check release

Overview

molfeat-hype is an extension of molfeat that investigates the performance of embeddings from various LLMs trained without explicit molecular context for molecular modeling. It leverages some of the most hyped LLM models in NLP to answer the following question:

Is it necessary to pretrain/finetune LLMs on molecular context to obtain good molecular representations?

To find an answer to this question, check out the benchmarks.

Spoilers YES! Understanding molecular context/structure/properties is key to building good molecular featurizers.

LLMs

molfeat-hype supports two types of LLM embeddings:

  1. Classic Embeddings: These are classical embeddings provided by foundation models (or any LLMs). The models available in this tool include OpenAI's openai/text-embedding-ada-002 model, llama, and several embedding models accessible through sentence-transformers.

  2. Instruction-based Embeddings: These are models that have been trained to follow instructions (thus acting like ChatGPT) or are conditional models that require a prompt.

    • Prompt-based instruction: A model (like Chat-GPT: openai/gpt-3.5-turbo) is asked to act like an all-knowing AI assistant for drug discovery and provide the best molecular representation for the input list of molecules. Here, we parse the representation from the Chat agent output.
    • Conditional embeddings: A model trained for conditional text embeddings that takes instruction as additional input. Here, the embedding is the model underlying representation of the molecule conditioned by the instructions it received. For more information, see this instructor-embedding.

Installation

You can install molfeat-hype using pip. conda installation is planned soon.

pip install molfeat-hype

molfeat-hype mostly depends on molfeat and langchain. Please see the env.yml file for a complete list of dependencies.

Acknowledgements

Check out the following projects that made molfeat-hype possible:

Usage

Since molfeat-hype is a molfeat plugin, it follows the same integration principle as with any other molfeat plugin.

The following shows examples of how to use the molfeat-hype plugin package automatically when installed.

  1. Using this package directly:
from molfeat_hype.trans.llm_embeddings import LLMTransformer

mol_transf = LLMTransformer(kind="sentence-transformers/all-mpnet-base-v2")
  1. Enabling autodiscovery as a plugin in molfeat, and addition of all embedding classes as an importable attribute to the entry point group molfeat.trans.pretrained:
# Put this somewhere in your code (e.g., in the root __init__ file).
# Plugins should include any subword of 'molfeat_hype'.
from molfeat.plugins import load_registered_plugins
load_registered_plugins(add_submodules=True, plugins=["hype"])
# This is now possible everywhere.
from molfeat.trans.pretrained import LLMTransformer
mol_transf = LLMTransformer(kind="sentence-transformers/all-mpnet-base-v2")

Once you have defined your molecule transformer, use it like any molfeat MoleculeTransformer:

import datamol as dm
smiles = dm.freesolv()["smiles"].values[:5]
mol_transf(smiles)

Changelog

See the latest changelogs at CHANGELOG.rst.

Maintainers

  • @maclandrol

Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether in the form of new features, improved infrastructure, or better documentation. For detailed information on how to contribute, see our contribution guide.

Disclaimer

This repository contains an experimental investigation of LLM embeddings for molecules. Please note that the consistency and usefulness of the returned molecular embeddings are not guaranteed. This project is meant for fun and exploratory purposes only and should not be used as a demonstration of LLM capabilities for molecular embeddings. Any statements made in this repository are the opinions of the authors and do not necessarily reflect the views of any affiliated organizations or individuals. Use at your own risk.

License

Under the Apache-2.0 license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molfeat-hype-0.1.0.tar.gz (70.0 MB view details)

Uploaded Source

Built Distribution

molfeat_hype-0.1.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file molfeat-hype-0.1.0.tar.gz.

File metadata

  • Download URL: molfeat-hype-0.1.0.tar.gz
  • Upload date:
  • Size: 70.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for molfeat-hype-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ae347e5ba4099ddc7a0b3d973f6f4a1af560636a9ddb5d6b042f7b16c5ec6932
MD5 4cf89e10d12d8aaae08e08d7601299be
BLAKE2b-256 e8a45e86bbf05da1a2303ea9a1b207b89100b02d70006e50a75b81867863a6b8

See more details on using hashes here.

File details

Details for the file molfeat_hype-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for molfeat_hype-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6373eafd654c5e907885911ce8d90e6684ad43cee1cb4360db1e2887fc3bdb84
MD5 3b72b1bc66932d50a34c632b228d4259
BLAKE2b-256 2458e1a15ee60d986674d2c2957ab08dc7ac20787e89a73543e109888d96de62

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page