A SDK for computational chemistry LLM hackthon

These details have not been verified by PyPI

Project description

QDX ChemLLMHack SDK

Welcome to the QDX Computational Chemistry and Large Language Model (LLM) Hackathon! This SDK is tailored specifically for the hackathon and is designed to seamlessly integrate with the most advanced RUSH's computational cloud platform. It enables the community to effortlessly develop and apply cutting-edge LLM and Artificial Intelligence (AI) technologies in computational chemistry.

Installation

To install the QDX ChemLLMHack SDK, simply run the following command:

pip install chemllmhack

Features

Get help information about the Rex language. Retrieve specific expressions used in RUSH modules from Rex Database.
Download the paper dataset and Chroma vector database. The Paper Dataset comprises a comprehensive collection of scientific papers sourced from open-access databases including arXiv, bioRxiv, chemRxiv, and medRxiv. Below are the statistics for the dataset:

Tool Number of Papers

MMseqs2 1048

PLIP 2199

Gina 1055

RDock 1772

Auto3d 371

BTK 1187

ColabFold 2138

P2Rank 1369
Submit your Rex expression to the RUSH platform.
Retrieve the results and stats against benchmark of your submitted Rex expression.

Tool	Number of Papers
MMseqs2	1048
PLIP	2199
Gina	1055
RDock	1772
Auto3d	371
BTK	1187
ColabFold	2138
P2Rank	1369

QDX Hackathon Use Guide

Prerequisites

Before you begin, make sure you have a Google account. You'll need this to register for the QDX Hackathon. You also need a OPENAI_API_KEY set up in your environment. You will be granted a RUSH token, make sure you set it up in your environment.

TENGU_TOKEN=<your-rush-token>

Registration

Getting Help Information for Rex Language

To get help information about the Rex language, use the following command:

chemllmhack --rex-help language

Retrieving Specific Rex Expressions and other information

To retrieve a specific Rex expression associated with a module, use the command below:

chemllmhack --rex-help expression -rex <module_name>

Replace <module_name> with the actual name of the module you're interested in. or you could use python language to query the rex expression:

from chemllmhack import get_rex_expression
get_rex_expression('module_name')

Following keywords are supported:

'help_info'
Module Name: 'prepare_protein', 'auto3d', 'gnina', 'gmx'
Module Parameters: 'auto3d_parameter', 'gnina_parameter'
'hackthon_task'
'comprehensive_example'

Querying with natural language

The SDK allows a LLM friendly way to query, to query with natural language, use the following command:

from chemllmhack import query
query('your-natural-language-query')

Downloading Datasets

you can download the necessary datasets:

Paper Dataset
Chroma Vector Database

The Vector Database collections are separated by the following modules:

mmseqs2
PLIP
gina
RDock
auto3d
BTK
ColabFold
P2Rank

Configuring Google Cloud CLI

To interact with Cloud Storage using the Google Cloud CLI, follow these steps:

Run the following command to authenticate:

sudo gcloud auth login

sudo gcloud auth application-default login

Provide the path to your credentials file. Typically, it is located at:

/Users/<your_user_name>/.config/gcloud/application_default_credentials.json

Replace <your_user_name> with your actual username on your system. Make sure you grant appropriate permissions to the json file. For more information, refer to the Google Cloud CLI documentation.

Downloading Datasets

from chemllmhack import download_vector_db
download_vector_db(credential_path='your-credential-path', destination_file_name='your-destination-file-name')

Submitting Rex Expressions to RUSH

To submit a Rex expression to the RUSH platform, use the following command:

from chemllmhack import submit_rex_expression
submit_rex_expression('your-rex-expression')

This function will automatically create a new project id and save it for future use. You could view your submitted run in the RUSH Website. This function will return the run id and status of the submitted run.

Querying run status

You could query the run status with the following command:

from chemllmhack import query_run_status
query_run_status('your-run-id')

this function will return the status of the submitted run and the result path id list.

Download run result

You could download the run result with the following command:

from chemllmhack import get_rex_result
get_rex_result(['path-id'], 'your-destination-file-name')

Compare result with benchmark

You could compare the result with benchmark with the following command:

Compare with affinity value benchmark

This function will calculate the pearson/spearman/kendall correlation coefficient between the predicted affinity value and the benchmark affinity value.

from chemllmhack import affinity_benchmark
affinity_benchmark('your-result-file', benchmark_name='BTK')

benchmark_name could be changed to the following:

BTK
BTK_mutant

The result file must be in the following format:

{
    "<SMILEs_1>": <affinity_value_1>
    "<SMILEs_2>": <affinity_value_2>
}

for example:

{
    "CC1=CC(=O)C(=C(C1=O)C)C": 0.1
    "CC1=CC(=O)C(=C(C1=O)C)": 0.2
}

Compare with RMSD value benchmark

This function will calculate the RMSD value between the predicted protein structure and the benchmark protein structure.

from chemllmhack import rmsd_benchmark
rmsd_benchmark('your-simulated_protein-file')

RAG Toolkit

The SDK also provides a RAG toolkit to help you build your AI experiment system. You could use the following command to get the RAG toolkit:

Multiple Query with RAG

This function, multi_query_rag, facilitates querying a Retrieval-Augmented Generation (RAG) model with a given question. It dynamically generates multiple versions of the input question to probe different perspectives and improve the breadth of the search in the vector database. This method is designed to address the inherent limitations of distance-based similarity searches by diversifying the questions.

Usage

To use the function, import it and provide the necessary arguments:

from chemllmhack import multi_query_rag

# Example usage:
answers = multi_query_rag(
    question="How to set auto3d parameters?",
    module_name="auto3d",
    vectordb_path="/path/to/vector/database"
)

Query Decomposition with RAG

The decompose_query_rag function is designed to enhance the querying process by decomposing a complex question into multiple simpler sub-questions. This approach allows for a more targeted retrieval of documents and information, facilitating a thorough understanding and response to the main query.

Usage

Import the function and specify the necessary parameters to begin decomposing and answering complex questions:

from chemllmhack import decompose_query_rag

# Example usage:
answer = decompose_query_rag(
    question="How to set auto3d parameters?",
    module_name="auto3d",
    vectordb_path="/path/to/vector/database"
)

Step Back Query with RAG

The step_back_query_rag function is specifically designed to reframe complex questions into simpler, more general inquiries. This approach allows for broader retrieval of relevant documents, potentially capturing a wider range of information that may be pertinent to the original question.

Usage

To utilize this function, import it from your package and provide the necessary arguments as follows:

from chemllmhack import step_back_query_rag

# Example usage:
answer = step_back_query_rag(
    question="How to set auto3d parameters?",
    module_name="auto3d",
    vectordb_path="/path/to/vector/database"
)

Your Task

Task 1: Correlation Coefficient Calculation

In the dataset we provide, a specific BTK protein along with its Mutant and corresponding ligands (provided in SMILES format) are given. Your task is to run the insilico experimental protocols on the RUSH platform, obtaining and recording the final simulated affinity values or other potential reference values. Use the provided benchmark functionality to calculate the correlation coefficient between the simulated affinity values and the experimental affinity values in the dataset, to assess the accuracy of the simulation.

Task 2: RMSD Calculation

We will provide a set of protein structures (without ligands) and a series of ligands (provided in SMILES format). Participants are required to predict the protein-ligand complex structures, run in vitro experimental protocols on the RUSH platform, and generate simulated PDB files. Use the benchmark functionality; our provided benchmark function will automatically align the generated protein structures and compare them with the actual crystal structures to calculate the RMSD value, to assess the consistency of the structure prediction.

Demo

You can find a hackthon demo in the demo folder.

Contributing

We welcome contributions from the community. If you'd like to contribute, please fork the repository and use a feature branch. Pull requests are warmly welcome.

Contact Information

For any questions or comments, please email bowen.zhang@qdx.co. Alternatively, you can open an issue in this repository's issue tracker.

Acknowledgments

Thanks to everyone participating in the development and use of this SDK. We hope it serves you well in the QDX Hackathon.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.9

Jul 26, 2024

0.1.8

Jul 26, 2024

0.1.7

Jul 24, 2024

0.1.6

Jul 24, 2024

0.1.5

Jul 19, 2024

0.1.4

Jul 18, 2024

0.1.2

Jul 15, 2024

0.1.1

Jul 15, 2024

0.1.0

Jul 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chemllmhack-0.1.9.tar.gz (473.7 kB view details)

Uploaded Jul 26, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chemllmhack-0.1.9-py3-none-any.whl (476.2 kB view details)

Uploaded Jul 26, 2024 Python 3

File details

Details for the file chemllmhack-0.1.9.tar.gz.

File metadata

Download URL: chemllmhack-0.1.9.tar.gz
Upload date: Jul 26, 2024
Size: 473.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.9 Darwin/23.5.0

File hashes

Hashes for chemllmhack-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`9a006dc5e0820a397b5a0b5fbfbcc810cb5b9dbf36c827a06161786237968a6a`
MD5	`963dd8e87cef1d3d557faf8b48ce63a6`
BLAKE2b-256	`94122cd3db2bee0d57e819694ac397dc5203db759d63863a6d59d92cb04b4eb1`

See more details on using hashes here.

File details

Details for the file chemllmhack-0.1.9-py3-none-any.whl.

File metadata

Download URL: chemllmhack-0.1.9-py3-none-any.whl
Upload date: Jul 26, 2024
Size: 476.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.9 Darwin/23.5.0

File hashes

Hashes for chemllmhack-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3564f6f23b8718a3c35e81adad417cc2028495c059d800765a1fb88d9723c48a`
MD5	`56134f9d906b185663456d96b0f6343e`
BLAKE2b-256	`88e696b333c2cfbe3f69977fc8a8676c92f360ab2e76ad7f493be1da0a8e0f8b`

See more details on using hashes here.

chemllmhack 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

QDX ChemLLMHack SDK

Installation

Features

QDX Hackathon Use Guide

Prerequisites

Registration

Getting Help Information for Rex Language

Retrieving Specific Rex Expressions and other information

Querying with natural language

Downloading Datasets

Configuring Google Cloud CLI

Downloading Datasets

Submitting Rex Expressions to RUSH

Querying run status

Download run result

Compare result with benchmark

Compare with affinity value benchmark

Compare with RMSD value benchmark

RAG Toolkit

Multiple Query with RAG

Usage

Query Decomposition with RAG

Usage

Step Back Query with RAG

Usage

Your Task

Demo

Contributing

Contact Information

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes