State is a machine learning model that predicts cellular perturbation response across diverse contexts.

Project description

Predicting cellular responses to perturbation across diverse contexts with State

Train State transition models or pretrain State embedding models. See the State paper.

See the Google Colab to train STATE for the Virtual Cell Challenge.

Associated repositories

Model evaluation framework: cell-eval
Dataloaders and preprocessing: cell-load

Getting started

Train an ST model for genetic perturbation prediction using the Replogle-Nadig dataset: Colab
Perform inference using an ST model trained on Tahoe-100M: Colab
Embed and annotate a new dataset using SE: Colab
Train STATE for the Virtual Cell Challenge: Colab

Installation

Installation from PyPI

This package is distributed via uv.

uv tool install arc-state

Installation from Source

git clone git@github.com:ArcInstitute/state.git
cd state
uv run state

When making fundamental changes to State, install an editable version with the -e flag.

git clone git@github.com:ArcInstitute/state.git
cd state
uv tool install -e .

CLI Usage

If installed via uv tool install, run state .... From source, run uv run state .... Use state --help (or state tx --help, state emb --help) to see available subcommands.

State Transition Model (ST)

Use state tx to train and run perturbation prediction models.

preprocess_train

Prepares training h5ad files by normalizing counts, applying log1p, and selecting highly variable genes (HVGs). The HVG matrix is stored in .obsm["X_hvg"], and the output .h5ad is written to --output.

state tx preprocess_train \
  --adata /path/to/raw_data.h5ad \
  --output /path/to/preprocessed_training_data.h5ad \
  --num_hvgs 2000

train

Trains an ST model using Hydra overrides. Point data.kwargs.toml_config_path at a TOML file that defines datasets and splits. output_dir and name define the run directory (output_dir/name).

state tx train \
  data.kwargs.toml_config_path=examples/fewshot.toml \
  data.kwargs.embed_key=X_hvg \
  data.kwargs.pert_col=target_gene \
  data.kwargs.cell_type_key=cell_type \
  training.max_steps=40000 \
  training.batch_size=8 \
  model=state \
  output_dir="$HOME/state" \
  name="test"

Optional downsampling overrides:

data.kwargs.downsample downsample counts during loading (<=1 keeps that fraction by binomial sampling, >1 targets that read depth per cell for output_space=all).
data.kwargs.downsample_cells caps the number of cells loaded per (cell_type, perturbation[, batch]) group while leaving smaller groups unchanged.

predict

Evaluates a trained run with cell-eval metrics (or just runs prediction with --predict-only). --output-dir should point to the run directory that contains config.yaml and checkpoints/.

state tx predict --output-dir $HOME/state/test --checkpoint final.ckpt

Use --toml to evaluate on a different dataset/split config than the one saved in the run.

infer

Runs inference on new data (not necessarily in the training TOML). Provide the run directory via --model-dir and an input AnnData via --adata. Use --embed-key X_hvg if you trained on HVG features.

state tx infer \
  --model-dir /path/to/run \
  --checkpoint /path/to/run/checkpoints/final.ckpt \
  --adata /path/to/preprocessed_data.h5ad \
  --pert-col gene \
  --embed-key X_hvg \
  --output /path/to/output.h5ad

If --output ends with .npy, only the predictions matrix is written (no .h5ad).

ST TOML configuration

The TOML file referenced by data.kwargs.toml_config_path defines dataset paths and splits.

Required sections:

[datasets]: map dataset names to directories of h5ad files
[training]: select which datasets participate in training

Optional sections:

[zeroshot]: hold out entire cell types for val/test
[fewshot]: hold out perturbations within specific cell types

Minimal example:

[datasets]
replogle = "/data/replogle/"

[training]
replogle = "train"

[zeroshot]
"replogle.jurkat" = "test"

[fewshot]
[fewshot."replogle.k562"]
val = ["AARS"]
test = ["NUP107"]

Notes:

Use the "dataset.cell_type" format in [zeroshot] and [fewshot].
Anything not listed in [zeroshot]/[fewshot] defaults to training.
Dataset paths should point to directories containing h5ad files.

See examples/zeroshot.toml and examples/fewshot.toml for more.

State Embedding Model (SE)

Use state emb to pretrain cell embeddings or embed new datasets.

preprocess (build a profile for your data)

This scans your train/val datasets and creates a new embedding profile. It:

expects CSVs with species,path,names columns (each names value becomes a dataset id in the config)
auto-detects the gene-name field in var (checks _index, gene_name, gene_symbols, feature_name, gene_id, symbol)
uses --all-embeddings if provided (a torch dict of {gene_name: embedding}); otherwise builds one-hot embeddings
writes all_embeddings_{profile}.pt, ds_emb_mapping_{profile}.torch, valid_genes_masks_{profile}.torch, plus train_{profile}.csv and val_{profile}.csv in --output-dir
updates the embeddings/dataset entries in the config file; it does not modify your h5ad files

By default it updates src/state/configs/state-defaults.yaml unless you pass --config-file.

Training on your own data (recommended workflow):

cp src/state/configs/state-defaults.yaml my_state.yaml

uv run state emb preprocess \
  --profile-name my_data \
  --train-csv /path/train.csv \
  --val-csv /path/val.csv \
  --output-dir /path/state_emb_profile \
  --config-file my_state.yaml \
  --all-embeddings /path/gene_embeddings.pt \
  --num-threads 8

Example CSV format:

species,path,names
human,/path/ds1.h5ad,ds1
human,/path/ds2.h5ad,ds2

Tip: --config-file should point to a full config (copy src/state/configs/state-defaults.yaml), since preprocess only updates the embeddings/dataset sections.

fit

Trains the embedding model. You must set embeddings.current and dataset.current to the profile created above.

uv run state emb fit --conf my_state.yaml \
  embeddings.current=my_data \
  dataset.current=my_data

You can override any training setting via Hydra (e.g., experiment.num_gpus_per_node=4).

transform

Computes embeddings for a new dataset using a checkpoint. Provide either --checkpoint or --model-folder. --output is required unless you write directly to LanceDB.

state emb transform \
  --model-folder /path/to/SE-600M \
  --checkpoint /path/to/SE-600M/se600m_epoch15.ckpt \
  --input /path/to/input.h5ad \
  --output /path/to/output.h5ad

If --output ends with .npy, only the embeddings matrix is written (no .h5ad is saved). Gene names are auto-detected from var (or var.index) based on overlap with the model's embeddings.

Vector Database (optional)

Install optional dependencies:

uv tool install ".[vectordb]"

If working off a previous installation, you may need:

uv sync --extra vectordb

Build a vector database:

state emb transform \
  --model-folder /path/to/SE-600M \
  --input /path/to/dataset.h5ad \
  --lancedb tmp/state_embeddings.lancedb

Query the database:

state emb query \
  --lancedb tmp/state_embeddings.lancedb \
  --input /path/to/query.h5ad \
  --output tmp/similar_cells.csv \
  --k 3

Singularity

Containerization for STATE is available via the singularity.def file.

Build the container:

singularity build state.sif singularity.def

Run the container:

singularity run state.sif --help

Example run of state emb transform:

singularity run --nv -B /large_storage:/large_storage \
  state.sif emb transform \
    --model-folder /large_storage/ctc/userspace/aadduri/SE-600M \
    --checkpoint /large_storage/ctc/userspace/aadduri/SE-600M/se600m_epoch15.ckpt \
    --input /large_storage/ctc/datasets/replogle/rpe1_raw_singlecell_01.h5ad \
    --output test_output.h5ad

Licenses

State code is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0).

The model weights and output are licensed under the Arc Research Institute State Model Non-Commercial License and subject to the Arc Research Institute State Model Acceptable Use Policy.

Any publication that uses this source code or model parameters should cite the State paper.

Project details

Release history Release notifications | RSS feed

This version

0.11.1

Jun 29, 2026

0.10.2

Jan 23, 2026

0.9.32

Jan 5, 2026

0.9.27

Sep 19, 2025

0.9.14

Aug 11, 2025

0.9.9

Jul 16, 2025

0.9.8

Jul 15, 2025

0.9.3

Jun 29, 2025

0.9.2

Jun 26, 2025

0.9.1

Jun 26, 2025

0.9.0

Jun 25, 2025

0.8.6

Jun 24, 2025

0.8.5

Jun 24, 2025

0.8.4

Jun 23, 2025

0.8.3

Jun 23, 2025

0.1.0

Jun 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arc_state-0.11.1.tar.gz (2.7 MB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arc_state-0.11.1-py3-none-any.whl (224.6 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file arc_state-0.11.1.tar.gz.

File metadata

Download URL: arc_state-0.11.1.tar.gz
Upload date: Jun 29, 2026
Size: 2.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for arc_state-0.11.1.tar.gz
Algorithm	Hash digest
SHA256	`5bccfe38bd7e5239854b7c5b45edf290f82d7690ab441fca07e5c5d0c28211e8`
MD5	`bce505053e7790f8ca5978123a00380f`
BLAKE2b-256	`2edd2377ffc272a06f20c8f46e19a71875aba3364acb6d0b3a94c4cd5effbc8c`

See more details on using hashes here.

File details

Details for the file arc_state-0.11.1-py3-none-any.whl.

File metadata

Download URL: arc_state-0.11.1-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 224.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for arc_state-0.11.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`087a8c12ac4931dd5cb5275cf0eaef8b6d660cf2b422ec769a86859fcd463bf2`
MD5	`096f284ca30aa04f3d98ccffc12b4b24`
BLAKE2b-256	`87b0478cfe976a259b6d2368e4510439cb3930030c35a2944b6d4adf365257a0`

See more details on using hashes here.

arc-state 0.11.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Predicting cellular responses to perturbation across diverse contexts with State

Associated repositories

Getting started

Installation

Installation from PyPI

Installation from Source

CLI Usage

State Transition Model (ST)

preprocess_train

train

predict

infer

ST TOML configuration

State Embedding Model (SE)

preprocess (build a profile for your data)

fit

transform

Vector Database (optional)

Singularity

Licenses

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes