Skip to main content

A package for the paper: learning molecular representation in a cell

Project description

Package for Learning Molecular Representation in a Cell

InfoAlign learns molecular representations from bottleneck information derived from molecular structures, cell morphology, and gene expressions. For more details, please refer to our paper.

InfoAlign

Requirements

This project was developed and tested with the following versions:

  • Python: 3.11.7
  • PyTorch: 2.2.0+cu118
  • Torch Geometric: 2.6.1

All dependencies are listed in the requirements.txt file.

Setup Instructions

  1. Create a Conda Environment:

    conda create --name infoalign python=3.11.7
    
  2. Activate the Environment:

    conda activate infoalign
    
  3. Install Dependencies:

    pip install -r requirements.txt
    

Usage

Fine-tuning

We provide a pretrained checkpoint available for download from Hugging Face. For fine-tuning and inference, use the following commands. The pretrained model will be automatically downloaded to the ckpt/pretrain.pt file by default.

python main.py --model-path ckpt/pretrain.pt --dataset finetune-chembl2k
python main.py --model-path ckpt/pretrain.pt --dataset finetune-broad6k
python main.py --model-path ckpt/pretrain.pt --dataset finetune-biogenadme
python main.py --model-path ckpt/pretrain.pt --dataset finetune-moltoxcast

Alternatively, you can manually download the model weights and place the pretrain.pt file under the ckpt folder along with its corresponding YAML configuration file.

Note: If you wish to access the cell morphology and gene expression features in the ChEMBL2k and Broad6K datasets for baseline evaluation, visit our Hugging Face repository to download these features.

Pretraining

To pretrain the model from scratch, execute the following command:

python main.py --model-path "ckpt/pretrain.pt" --lr 1e-4 --wdecay 1e-8 --batch-size 3072

This will automatically download the pretraining dataset from Hugging Face. If you prefer to download the dataset manually, place all pretraining data files in the raw_data/pretrain/raw folder.

The pretrained model will be saved in the ckpt folder as pretrain.pt.


Data source

For readers interested in data collection, here are the sources:

  1. Cell Morphology Data

    • JUMP dataset: The data are from "JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations" and can be downloaded here. The dataset includes chemical and genetic perturbations for cell morphology features.
    • Bray's dataset: "A dataset of images and morphological profiles of 30,000 small-molecule treatments using the Cell Painting assay". Download from GigaDB. Processed version available on Zenodo.
  2. Gene Expression Data

    • LINCS L1000 gene expression data from the paper "Drug-induced adverse events prediction with the LINCS L1000 data": Data.
  3. Relationships

    • Gene-gene, gene-compound relationships from Hetionet: Data.

Citation

If you find this repository useful, please cite our paper:

@article{liu2024learning,
  title={Learning Molecular Representation in a Cell},
  author={Liu, Gang and Seal, Srijit and Arevalo, John and Liang, Zhenwen and Carpenter, Anne E and Jiang, Meng and Singh, Shantanu},
  journal={arXiv preprint arXiv:2406.12056},
  year={2024}
}

Acknowledgement

Template adapted from: https://github.com/lwaekfjlk/python-project-template. Thanks to the authors for their open-source contribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infoalign-0.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

infoalign-0.1.0-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file infoalign-0.1.0.tar.gz.

File metadata

  • Download URL: infoalign-0.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for infoalign-0.1.0.tar.gz
Algorithm Hash digest
SHA256 40d98c36d865279c723a6b759c27b04f9cd67b94e0562fc655ed8b01b96384fb
MD5 1ea24327914ceafeaf369314d89593ba
BLAKE2b-256 a8ce1f59267ed92665fe72d76fbca88a673245c1f6652f369e324833ec52b034

See more details on using hashes here.

File details

Details for the file infoalign-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: infoalign-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for infoalign-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cc8d916d653d84a47a49636c11dfdea19dd46d2ab01ecd801bb7c501d490f6e
MD5 125a60ec8728a855dc6c478604c27f09
BLAKE2b-256 1211d07783528afc4d3328e1557879369caeaabc4a3a4f597a19586e8047b8bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page