Skip to main content

Equivariant Flow Matching for Molecular Conformer Generation

Project description

ET-Flow: Equivariant Flow Matching for Molecular Conformer Generation

Implementation of Equivariant Flow Matching for Molecule Conformer Generation by M Hassan, N Shenoy, J Lee, H Stark, S Thaler and D Beaini.

ET-Flow is a state-of-the-art generative model for generating small molecule conformations using equivariant transformers and flow matching.

Install Etflow

We are now available on PyPI. Easily install the package using the following command:

pip install etflow

Setup dev Environment

Run the following commands to setup the environment:

conda env create -n etflow -f env.yml
conda activate etflow
# to install the etflow package
python3 -m pip install -e .

Generating Conformations for Custom Smiles

We have a sample notebook (generate_confs.ipynb) to generate conformations for custom smiles input. One needs to pass the config and corresponding checkpoint path in order as additional inputs.

We have added support to load the model config and checkpoint with automatic download and caching. See (tutorial.ipynb) or use the following snippet to load the model and generate conformations for custom smiles input.

from etflow import BaseFlow
model=BaseFlow.from_default(model="drugs-o3")
model.predict(['CN1C=NC2=C1C(=O)N(C(=O)N2C)C'], num_samples=3, as_mol=True)

We currently support the following configurations and checkpoint:

  • drugs-o3
  • qm9-o3
  • drugs-so3

Preprocessing Data

To pre-process the data, perform the following steps,

  1. Download the raw GEOM data and unzip the raw data using the following commands,
wget https://dataverse.harvard.edu/api/access/datafile/4327252 -O <output_folder_path/rdkit_folder.tar>
tar -zxvf <output_folder_path/rdkit_folder.tar>
  1. Process the data for ET-Flow training. First, set the DATA_DIR environment variable. All preprocessed data will be created inside this.
export DATA_DIR=</path_to_data>
python scripts/prepare_data.py -p /path/to/geom/rdkit-raw-folder
  1. Download the splits from the zenodo link (https://zenodo.org/records/13870058). Once these files are downloaded, extract the zip files to the respective folders inside $DATA_DIR,
unzip QM9.zip -d $DATA_DIR
unzip DRUGS.zip -d $DATA_DIR

Training

We provide our configs for training on the GEOM-DRUGS and the GEOM-QM9 datasets in various configurations. Run the following commands once datasets are preprocessed and the environment is set up:

python etflow/train.py -c configs/drugs-base.yaml

The following two configs from the configs/ directory can be used for replicating paper results:

  • drugs-base.yaml: ET-Flow trained on GEOM-DRUGS dataset
  • qm9-base.yaml: ET-Flow trained on GEOM-QM9 dataset

Evaluation

Before running eval with any checkpoint, create an evaluation csv (will be saved at $DATA_DIR/processed/geom.csv), using the following script,

python scripts/prepare_eval_csv.py -p /path/to/geom/rdkit-raw-folder

Evaluation happens in 2 steps as follows,

  1. Generating Conformations To run the evaluation on either GEOM or QM9 given a config and a checkpoint, run the following command,
# here n: number of inference steps for flow matching
python etflow/eval.py --config=<config-path> --checkpoint=<checkpoint-path> --dataset_type=qm9 --nsteps=50

To run the evaluation on GEOM-XL (a test-set containing much larger molecules), run the following command,

python etflow/eval_xl.py --config=<config-path> --checkpoint=<checkpoint-path> --batch_size=16 --nsteps=50
  1. Evaluating Conformations with RMSD Metrics The above sample generation script should created a generated_files.pkl at the following path, logs/samples/<config-path>/<data-time>/flow_nsteps_{value-passed-above}/generated_files.pkl. With the given path, we can get the various RMSD metrics using,
python etflow/eval_cov_mat.py --path=<path-to-generated-files.pkl> --num_workers=10

Loading a Pre-Trained Checkpoint

Coming Soon!

Acknowledgements

Our codebase is built using the following open-source contributions,

Contact

For further questions, feel free to raise an issue.

Citation

@misc{hassan2024etflow,
      title={ET-Flow: Equivariant Flow-Matching for Molecular Conformer Generation},
      author={Majdi Hassan and Nikhil Shenoy and Jungyoon Lee and Hannes Stark and Stephan Thaler and Dominique Beaini},
      year={2024},
      eprint={2410.22388},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM},
      url={https://arxiv.org/abs/2410.22388},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etflow-0.1.1.tar.gz (94.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

etflow-0.1.1-py3-none-any.whl (57.3 kB view details)

Uploaded Python 3

File details

Details for the file etflow-0.1.1.tar.gz.

File metadata

  • Download URL: etflow-0.1.1.tar.gz
  • Upload date:
  • Size: 94.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for etflow-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5801728395917fc7b8a42d01b417ddc17f4ff1fba77361652c421e98c0779a0c
MD5 a01d97dbd10429fb1d26d96e95be5ad0
BLAKE2b-256 7c23d7b06098b8b1c175d4b53aaad30a4ae7adec8d86b5b4d9228fef06817d72

See more details on using hashes here.

File details

Details for the file etflow-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: etflow-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 57.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for etflow-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2a16eecaa5730e4ff0e9af61ec1ef175bece1b603fc58d09c0810bafddb1f9fd
MD5 431863b0b0b59b5519ab9318b0520a11
BLAKE2b-256 36090171e535cbfc24c1f005db19672674e80b7f74f021469867f14fa715036e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page