Cryo-EM data processing tools for deep learning (e.g., cryo-IEF).

These details have not been verified by PyPI

Project links

Homepage

Project description

cryodata

Cryo-EM data processing tools for deep learning. This package provides a full pipeline for converting raw cryo-EM particle data from cryoSPARC into PyTorch-ready datasets, as used by cryo-IEF, CryoDECO and CryoWizard.

Features

Preprocessing pipeline — resize, normalize, and window-mask cryo-EM particles from cryoSPARC jobs
LMDB dataset creation — fast multi-process conversion of MRC stacks into LMDB databases for efficient training I/O
PyTorch dataset & sampler — CryoEMDataset and CryoMetaData classes with support for balanced resampling
Fourier-space representations — optional FFT/Hilbert-transform outputs alongside real-space images
Format conversion — convert cryoSPARC .cs files to RELION .star format

Installation

pip install cryodata

For development:

git clone https://github.com/yanyang1998/cryoief-data
cd cryoief-data
pip install -e .

Quick Start

from cryodata.data_preprocess.mrc_preprocess import raw_data_preprocess
from cryodata.cryoemDataset import CryoEMDataset, CryoMetaData
import torch

raw_data_path = 'path/to/cryosparc/particles/job'
processed_data_path = 'path/to/processed/data'

# Step 1: Preprocess raw cryoSPARC particle data
new_cs_data = raw_data_preprocess(
    raw_data_path,
    processed_data_path,
    resize=224,          # resize particles to 224×224
    save_raw_data=False, # skip saving unprocessed images
    save_FT_data=False,  # skip saving Fourier-space images
    is_to_int8=True,     # convert to uint8 for storage efficiency
)

# Step 2: Load the dataset
meta_data = CryoMetaData(processed_data_path=processed_data_path)
cryodataset = CryoEMDataset(metadata=meta_data)

# Step 3: Create a DataLoader for training
dataloader = torch.utils.data.DataLoader(cryodataset, batch_size=32, shuffle=True)

API Reference

Preprocessing

`raw_data_preprocess`

from cryodata.data_preprocess.mrc_preprocess import raw_data_preprocess

new_cs_data = raw_data_preprocess(
    raw_dataset_dir,
    dataset_save_dir,
    resize=224,
    is_to_int8=True,
    save_raw_data=True,
    save_FT_data=True,
    use_lmdb=True,
    num_processes=8,
)

The main entry point for the preprocessing pipeline. Reads cryoSPARC .cs metadata and associated MRC particle stacks from raw_dataset_dir, applies the selected transforms, and writes the output to dataset_save_dir. Internally it calls raw_csdata_process_from_cryosparc_dir to locate and merge the correct .cs files, then builds an LMDB database (when use_lmdb=True) or individual pickle files (when use_lmdb=False). Returns the merged cryoSPARC Dataset object.

Parameter	Type	Default	Description
`raw_dataset_dir`	`str`	—	Path to a cryoSPARC job output directory (e.g., a particle extraction job)
`dataset_save_dir`	`str`	—	Directory where processed data and metadata will be saved
`resize`	`int`	`224`	Target image size in pixels (square); uses FFT-based downsampling when reducing, bicubic otherwise
`is_to_int8`	`bool`	`True`	Normalize each particle to [0, 255] and cast to `uint8` for compact storage
`save_raw_data`	`bool`	`True`	Save unprocessed raw particles alongside the processed ones (only applies when `use_lmdb=False`)
`save_FT_data`	`bool`	`True`	Compute and save Hartley-transform (real-valued Fourier) representations (only applies when `use_lmdb=False`)
`use_lmdb`	`bool`	`True`	Write output to an LMDB database for fast I/O during training (recommended)
`num_processes`	`int`	`8`	Number of worker processes for parallel MRC file processing

`raw_csdata_process_from_cryosparc_dir`

from cryodata.data_preprocess.mrc_preprocess import raw_csdata_process_from_cryosparc_dir

cs_data, mrc_dir = raw_csdata_process_from_cryosparc_dir(raw_data_path)

Scans a cryoSPARC job directory and locates the relevant .cs particle file and the corresponding MRC stack directory. Handles various cryoSPARC job types (extraction, import, restack, downsampling). When both a particles .cs file and a passthrough file are found, they are merged via an inner join. Returns the Dataset object and the path (or list of paths) to the MRC stacks.

`mrcs_resize`

from cryodata.data_preprocess.mrc_preprocess import mrcs_resize

resized = mrcs_resize(mrcs, width, height=None, is_freqs=True)

Resizes a 2D image or a batch of images. Accepts a NumPy array or a PIL Image.

Parameter	Type	Default	Description
`mrcs`	`np.ndarray` or `PIL.Image`	—	Single image `(H, W)` or image stack `(N, H, W)`
`width`	`int`	—	Target width (and height, if `height` is not given) in pixels
`height`	`int`	`None`	Target height; defaults to `width` for square output
`is_freqs`	`bool`	`True`	When `True` and target is smaller than source, downsample in the Fourier domain (FFT crop); otherwise use bicubic spatial interpolation

`mrcs_to_int8`

from cryodata.data_preprocess.mrc_preprocess import mrcs_to_int8

uint8_stack = mrcs_to_int8(mrcs)

Normalizes each image in a batch to [0, 255] and converts to uint8. Accepts both NumPy arrays and PyTorch tensors of shape (N, H, W). Applies to_int8 to every image in the batch independently.

`to_int8`

from cryodata.data_preprocess.mrc_preprocess import to_int8

img_uint8 = to_int8(mrcdata)

Normalizes a single 2D image to [0, 255] and converts to uint8. For NumPy input, returns a PIL Image (grayscale). For PyTorch tensor input, returns a uint8 tensor.

`window_mask`

from cryodata.data_preprocess.mrc_preprocess import window_mask

mask = window_mask(resolution, in_rad, out_rad=0.99)

Generates a 2D radial cosine-edge windowing mask of shape (resolution, resolution). The mask is 1.0 inside in_rad and tapers smoothly to 0.0 at out_rad. Useful for suppressing edge artifacts before computing FFTs.

Parameter	Type	Default	Description
`resolution`	`int`	—	Image size in pixels; must be even
`in_rad`	`float`	—	Inner radius as a fraction of the image half-width (e.g. `0.85` means 85% of the half-width)
`out_rad`	`float`	`0.99`	Outer radius where the mask reaches 0

`sample_and_evaluate`

from cryodata.data_preprocess.mrc_preprocess import sample_and_evaluate

mean_len = sample_and_evaluate(
    path_list, save_path,
    num_stacks=50, num_particles=20000,
    window=False, window_r=0.85, needs_FT=False,
)

Estimates dataset statistics by randomly sampling MRC stacks. Saves means_stds_raw.data, means_stds_FT.data, and img_dim.data to save_path. Returns the average number of particles per stack (mean_imgs_len), which is used to estimate the LMDB map size.

Parameter	Type	Default	Description
`path_list`	`list[str]`	—	List of MRC file paths to sample from
`save_path`	`str`	—	Directory to save the computed statistics
`num_stacks`	`int`	`50`	Number of MRC stacks to randomly sample
`num_particles`	`int`	`20000`	Total number of particles to sample across all stacks
`window`	`bool`	`False`	Apply a radial window mask before computing statistics
`window_r`	`float`	`0.85`	Inner radius for the window mask
`needs_FT`	`bool`	`False`	Also compute and save Hartley-transform statistics

Dataset

`CryoMetaData`

from cryodata.cryoemDataset import CryoMetaData

meta_data = CryoMetaData(processed_data_path='path/to/processed/data')

Loads and stores all metadata for a preprocessed cryo-EM dataset. Automatically detects whether the dataset uses LMDB storage or individual pickle files. Only processed_data_path is required; all other parameters are optional.

Parameter	Type	Default	Description
`processed_data_path`	`str`	—	Path to the directory produced by `raw_data_preprocess`
`emfile_path`	`str`	`None`	Optional path to a `.star` or `.cs` particle file for selection/filtering
`selected_emfile_path`	`str`	`None`	Optional path to a second particle file specifying selected particles
`ctf_correction_averages`	`bool`	`False`	Load CTF-corrected class-average paths if available
`ctf_correction_inference`	`bool`	`False`	Load CTF-corrected particle paths for inference if available

`CryoEMDataset`

from cryodata.cryoemDataset import CryoEMDataset

dataset = CryoEMDataset(metadata=meta_data)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

A torch.utils.data.Dataset that loads preprocessed cryo-EM particles from an LMDB database or pickle files. Images larger than 384 pixels are treated as micrographs; smaller images are treated as particles. Supports optional on-the-fly transforms passed at construction time.

Resampling

`MyResampleSampler`

from cryodata import MyResampleSampler

sampler = MyResampleSampler(
    data=dataset,
    id_index_dict_pos=pos_dict,   # {class_id: [indices]} for high-quality particles
    id_index_dict_mid=mid_dict,   # {class_id: [indices]} for medium-quality particles
    id_index_dict_neg=neg_dict,   # {class_id: [indices]} for low-quality particles
    resample_num_pos=500,         # max particles per class from the positive set
    resample_num_mid=200,         # max particles per class from the medium set
    resample_num_neg=100,         # max particles per class from the negative set
)

A torch.utils.data.Sampler designed for fine-tuning scenarios where particles have been labelled as positive, mid, or negative quality. At each epoch it resamples each class up to the specified cap, then concatenates the three groups into a single index list. Shuffle behaviour is controlled by shuffle_type ('all', 'class', or 'batch').

`MyResampleSampler_pretrain`

from cryodata import MyResampleSampler_pretrain

sampler = MyResampleSampler_pretrain(
    id_index_dict=id_index_dict,     # {class_id: [indices]}
    batch_size_all=256,              # total batch size across all processes
    max_number_per_sample=1000,      # max particles sampled per class per epoch
    shuffle_type='class',            # 'all', 'class', or 'batch' (int)
    shuffle_mix_up_ratio=0.2,        # fraction of each class used for cross-class mixing
    bad_particles_ratio=0.1,         # fraction of slots given to low-quality particles
)

A torch.utils.data.Sampler for pre-training with large multi-class datasets. Resamples each class up to max_number_per_sample and optionally mixes a fraction of particles across classes to improve generalisation. Supports multi-process training via num_processes.

Format Conversion

`cs2star`

from cryodata.cs_star_translate.cs2star import cs2star

# Single CS file
cs2star('particles.cs', 'output.star')

# CS file with a passthrough file
cs2star(['particles.cs', 'passthrough_particles.cs'], 'output.star')

Converts a cryoSPARC .cs file to a RELION-compatible STAR file. When multiple input paths are provided, the first is the primary .cs file and the rest are passthrough files whose columns are merged in. The output STAR file includes RELION 3.1 optics group metadata.

Parameter	Type	Default	Description
`input`	`str` or `list[str]`	—	Path(s) to `.cs` file(s); if a list, first entry is the primary file and the rest are passthroughs
`output`	`str`	—	Path for the output `.star` file
`minphic`	`float`	`None`	Minimum posterior probability threshold for class assignment
`boxsize`	`int`	`None`	Override particle box size in the output
`noswapxy`	`bool`	`False`	Disable the default X/Y coordinate swap
`invertx`	`bool`	`False`	Invert X coordinates
`inverty`	`bool`	`False`	Invert Y coordinates

FFT Utilities

The fft module provides centered Fourier and Hartley transforms for 2D cryo-EM images. All functions accept NumPy arrays; fft2_center and ht2_center additionally accept PyTorch tensors via tensor=True.

from cryodata.data_preprocess import fft

Function	Description
`fft2_center(img, tensor=False)`	2D centered FFT. Returns a complex array of the same shape as `img`.
`fftn_center(img)`	N-D centered FFT.
`ht2_center(img, tensor=False)`	2D Hartley transform: `Re(FFT) - Im(FFT)`. Real-valued output, same shape as `img`.
`htn_center(img)`	N-D Hartley transform.
`iht2_center(img)`	Inverse 2D Hartley transform.
`ihtn_center(vol)`	Inverse N-D Hartley transform.
`symmetrize_ht(ht)`	Adds a wrap-around row and column to a Hartley-transformed image or batch `(N, H, W)`, making it `(N, H+1, W+1)`. Required before saving FFT data for downstream use.
`symmetrize_ht_torch(ht)`	Same as `symmetrize_ht` but for PyTorch tensors.

Example — compute a real-valued Fourier representation:

import numpy as np
from cryodata.data_preprocess import fft
from cryodata.data_preprocess.mrc_preprocess import window_mask

image = np.random.randn(224, 224).astype(np.float32)

# Apply a window mask to reduce edge ringing
mask = window_mask(224, in_rad=0.85)
image_windowed = image * mask

# 2D Hartley transform
ht = fft.ht2_center(image_windowed)

# Add the wrap-around border required by downstream models
ht_sym = fft.symmetrize_ht(ht)  # shape: (225, 225)

Dependencies

Category	Packages
Deep learning	`torch`, `torchvision`, `accelerate`
Scientific computing	`numpy`, `scipy`, `numba`, `pyFFTW`
Data handling	`pandas`, `lmdb`, `mrcfile`
Cryo-EM	`cryosparc_tools`
ML utilities	`scikit-learn`, `annoy`
Visualization	`matplotlib`, `seaborn`, `Pillow`

Related Projects

cryo-IEF — the deep learning model this package was built to support
CryoRanker — a deep learning model for cryo-EM particle ranking
CryoDECO — an ab initio heterogeneous reconstruction algorithm that leverages Cryo-IEF priors
CryoWizard — integrates CryoRanker into a fully automated single-particle cryo-EM processing pipeline
cryoSPARC — upstream software for particle extraction and reconstruction

Citation

Please cite the following paper if this work is useful for your research:

@article{yan_comprehensive_2025,
	title = {A comprehensive foundation model for cryo-{EM} image processing},
	issn = {1548-7105},
	url = {https://doi.org/10.1038/s41592-025-02916-8},
	doi = {10.1038/s41592-025-02916-8},
	abstract = {Cryogenic electron microscopy (cryo-EM) has become a premier technique for determining high-resolution structures of biological macromolecules. However, its broad application is constrained by the demand for specialized expertise. Here, to address this limitation, we introduce the Cryo-EM Image Evaluation Foundation (Cryo-IEF) model, a versatile tool pre-trained on {\textasciitilde}65 million cryo-EM particle images through unsupervised learning. Cryo-IEF performs diverse cryo-EM processing tasks, including particle classification by structure, pose-based clustering and image quality assessment. Building on this foundation, we developed CryoWizard, a fully automated single-particle cryo-EM processing pipeline enabled by fine-tuned Cryo-IEF for efficient particle quality ranking. CryoWizard resolves high-resolution structures across samples of varied properties and effectively mitigates the prevalent challenge of preferred orientation in cryo-EM.},
	journal = {Nature Methods},
	author = {Yan, Yang and Fan, Shiqi and Yuan, Fajie and Shen, Huaizong},
	month = nov,
	year = {2025},
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.3b2 pre-release

Apr 17, 2026

This version

0.1.2b1 pre-release

Mar 17, 2026

0.1.1b6 pre-release

Mar 8, 2026

0.1.1b5 pre-release

Dec 26, 2025

0.1.1b4 pre-release

Dec 16, 2025

0.1.1b3 pre-release

Dec 15, 2025

0.1.1b2 pre-release

Aug 4, 2025

0.1.1b1 pre-release

Jul 21, 2025

0.1.1b0 pre-release

Jul 21, 2025

0.1.0b15 pre-release

Jul 7, 2025

0.1.0b13 pre-release

Jul 4, 2025

0.1.0b12 pre-release

Jul 3, 2025

0.1.0b11 pre-release

Jul 3, 2025

0.1.0b10 pre-release

Jun 27, 2025

0.1.0b9 pre-release

Jun 27, 2025

0.1.0b8 pre-release

Jun 27, 2025

0.1.0b7 pre-release

Jun 23, 2025

0.1.0b6 pre-release

Jun 23, 2025

0.1.0b5 pre-release

Jun 17, 2025

0.1.0b4 pre-release

Jun 17, 2025

0.1.0b3 pre-release

Jun 17, 2025

0.1.0b2 pre-release

Jun 17, 2025

0.1.0b1 pre-release

Jun 17, 2025

0.1.0a6 pre-release

Jun 17, 2025

0.1.0a5 pre-release

Jun 17, 2025

0.1.0a4 pre-release

Jun 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cryodata-0.1.2b1.tar.gz (76.9 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cryodata-0.1.2b1-py2.py3-none-any.whl (86.9 kB view details)

Uploaded Mar 17, 2026 Python 2Python 3

File details

Details for the file cryodata-0.1.2b1.tar.gz.

File metadata

Download URL: cryodata-0.1.2b1.tar.gz
Upload date: Mar 17, 2026
Size: 76.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for cryodata-0.1.2b1.tar.gz
Algorithm	Hash digest
SHA256	`6b03ecb78e38fa9afc6bab9e453d6ada6909ca726c89645c320286a9d763035d`
MD5	`7b1f7d1e75d72c6c02b89de3fccdbd95`
BLAKE2b-256	`c967b7c3ca085632f695739d6a07f6dc7d150e42d3c181532d7dd73d18c21ea2`

See more details on using hashes here.

File details

Details for the file cryodata-0.1.2b1-py2.py3-none-any.whl.

File metadata

Download URL: cryodata-0.1.2b1-py2.py3-none-any.whl
Upload date: Mar 17, 2026
Size: 86.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for cryodata-0.1.2b1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`a36be5811eba5ce6bfd71266cb1033ef2d0023d36ff8939de8a5e5405d9b5fab`
MD5	`2e7c4377daa7dec42f71e0654e39eb69`
BLAKE2b-256	`c815053847e03329655f3cd11536bc7ffd9b63492c7286fba46d0bfe1d3c029e`

See more details on using hashes here.

cryodata 0.1.2b1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cryodata

Features

Installation

Quick Start

API Reference

Preprocessing

raw_data_preprocess

raw_csdata_process_from_cryosparc_dir

mrcs_resize

mrcs_to_int8

to_int8

window_mask

sample_and_evaluate

Dataset

CryoMetaData

CryoEMDataset

Resampling

MyResampleSampler

MyResampleSampler_pretrain

Format Conversion

cs2star

FFT Utilities

Dependencies

Related Projects

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`raw_data_preprocess`

`raw_csdata_process_from_cryosparc_dir`

`mrcs_resize`

`mrcs_to_int8`

`to_int8`

`window_mask`

`sample_and_evaluate`

`CryoMetaData`

`CryoEMDataset`

`MyResampleSampler`

`MyResampleSampler_pretrain`

`cs2star`