Skip to main content

An image feature extractor with self-supervised learning

Project description

cytoself

Code style: black PyPI Python Version DOI License

A rotating 3D UMAP

Self-supervised deep learning encodes high-resolution features of protein subcellular localization

cytoself is a self-supervised model that we developed for learning features of protein subcellular localization from microscopy images. This model is described in detail in our paper [1]. The image representations derived from cytoself encapsulate highly specific features that can derive functional insights for proteins on the sole basis of their localization.

Applying cytoself to images of endogenously labeled proteins from the recently released OpenCell database creates a highly resolved protein localization atlas [2].

[1] Kobayashi, Hirofumi, et al. "Self-Supervised Deep-Learning Encodes High-Resolution Features of Protein Subcellular Localization." Nature Methods (2022). https://www.nature.com/articles/s41592-022-01541-z
[2] Cho, Nathan H., et al. "OpenCell: Endogenous tagging for the cartography of human cellular organization." Science 375.6585 (2022): eabi6983. https://www.science.org/doi/10.1126/science.abi6983

How cytoself works

cytoself uses images and an associated identity information (ID) as a label to learn the localization patterns of proteins. When applied to OpenCell we used cell images where individual proteins are endogenously tagged per image. For each image we know which protein is tagged and that is the ID used. Our model implicitely learns to ignore image differences for images that are associated to the same ID, and tries its best to tell images apart if they are associated to different IDs. In practice cytoself can resolve very fine textural differences between image classes but also can ignore very complex sources of image variability such as cell shapes, states, etc...

A schematic of cytoself

What's in this repository

This repository offers three main components: DataManager, cytoself.models, and Analytics.

DataManager is a simple module to handle train, validate and test data. You may want to modify it to adapt to your own data structure. This module is in cytoself.data_loader.data_manager.

cytoself.models contains modules for three different variants of the cytoself model: a model without split-quantization, a model without the pretext task, and the 'full' model (refer to our preprint for details about these variants). There is a submodule for each model variant that provides methods for constructing, compiling, and training the models (which are built using tensorflow).

Analytics is a simple module to perform analytic processes such as dimension reduction and plotting. You may want to modify it too to perform your own analysis. This module is in cytoself.analysis.analytics. Open In Colab

Pre-trained model weights are included in the example script.

Note: Cytoself will migrate to pytorch implementation in the near future.

Installation

Recommended: create a new environment and install cytoself on the environment from pypi

conda create -y -n cytoself python=3.7
conda activate cytoself
pip install cytoself

(Option) Install TensorFlow GPU

If your computer is equipped with GPUs that support Tensorflow 1.15, you can install Tensorflow-gpu to utilize GPUs. Install the following packages before cytoself, or uninstall the existing CPU versions and reinstall the GPU versions again with conda.

conda install -y h5py=2.10.0 tensorflow-gpu=1.15

For developers

You can also install cytoself from this GitHub repository.

git clone https://github.com/royerlab/cytoself.git
pip install .

Troubleshooting

In case of getting errors in the installation, run the following code inside the cytoself folder to manually install the dependencies.

pip install -r requirements.txt

As a reference for a complete dependency, a snapshot of a working environment can be found in environment.yml

Example script (How to use cytoself)

A minimal example script is in example/simple_training.py. Learn how to use cytoself through Open In Colab

Test if this package runs in your computer with command

python examples/simple_example.py

Computational resources

It is highly recommended to use a GPU to run cytoself. For example, a full model with image shape of (100, 100, 2) and batch size 64 can take ~9GB of GPU memory.

Tested Environment

Google Colab (CPU/GPU/TPU)

macOS 10.14.6, RAM 32GB (CPU)

Windows10 Pro 64bit, RTX 1080Ti, CUDA 11.6 (CPU/GPU)

Ubuntu 18.04.6 LTS, RTX 2080Ti, CUDA 11.2 (CPU/GPU)

Data Availability

Pretrained model

Pre-trained models used in the paper. Please follow the example script or Open In Colab to lean how to use a pre-trained model.

model_protein_nucleardistance.h5 : The model trained on target protein and nuclear distance.
model_protein.h5 : The model trained on target protein alone.
model_protein_nucleus.h5 : The model trained on target protein and nucleus.

The full data of image and protein label used in this work can be found here. The image data have the shape of [batch, 100, 100, 4], in which the last channel dimension corresponds to [target protein, nucleus, nuclear distance, nuclear segmentation].

Embeddings

The embedding vectors of global representations and their labels are available from the following links. Due to their large size, only embeddings extracted from test data are provided.

Global_representation.npy In the shape of 114,806 images x 9,216 latent dimensions. (3.9 GB)
label.csv 114,806 rows x 7 columns. (6.2 MB)

Image and label data

Due to the large size, the whole data is split to 10 files. The files are intended to be concatenated together to form one large numpy file or one large csv.

Image_data00.npy
Image_data01.npy
Image_data02.npy
Image_data03.npy
Image_data04.npy
Image_data05.npy
Image_data06.npy
Image_data07.npy
Image_data08.npy
Image_data09.npy
Label_data00.csv
Label_data01.csv
Label_data02.csv
Label_data03.csv
Label_data04.csv
Label_data05.csv
Label_data06.csv
Label_data07.csv
Label_data08.csv
Label_data09.csv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cytoself-0.0.1.4.tar.gz (47.9 kB view hashes)

Uploaded Source

Built Distribution

cytoself-0.0.1.4-py3-none-any.whl (60.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page