An imputation method for multiplexed DNA FISH data
Project description
SnapFISH-IMPUTE: an imputation method for multiplexed DNA FISH data
SnapFISH-IMPUTE fills in the missing 3D coordinates of imaging loci in multiplexed DNA FISH data.
Installation
SnapFISH-IMPUTE is available on PyPI and can be installed through pip
. To create a virtual environment before installation, use either conda
:
conda create --name sfimpute_env python==3.9.1
conda activate sfimpute_env
pip install sfimpute
or if on a computing cluster:
module load python/3.9.1
python -m venv /PATH/TO/ENV
source /PATH/TO/ENV/bin/activate
pip install sfimpute
Although MPI
and mpi4py
are needed to run the imputation module, the package on PyPI
is independent of MPI
, so you can still call functions in SnapFISH-IMPUTE even if MPI is not available.
To install mpi4py
, please follow the instructions in this link.
Usage
Imputation
To run the imputation module on a computing cluster, download run_impute.py
and include the following command when submitting the job:
mpiexec -np 50 python run_impute.py -o $OUT/DIRE -d $COORPATH -a $ANNPATH -s $SUF
where
OUT/DIRE
: the directory to store imputation resultsCOORPATH
: the path of the 3D coordinates file (.txt file separated by\t
) with the following columns- region: imaging region ID
- haploid: haploid ID (unique within each imaging region)
- pos: locus ID (starts from 1)
- x, y, z: 3D coordinates in nm (missing values are replaced by NaN)
ANNPATH
: the path of the annotation file (.txt file separated by\t
) with the following columns- region: imaging region ID
- pos: locus ID (starts from 1)
- start: starting 1D genomic location of the imaging locus
- end: ending 1D genomic location of the imaging locus
SUF
: file suffix
The jupyter notebook preprcess.ipynb
shows how to convert the imaging data to the desired form.
The output will be
$ tree OUT/DIRE
OUT/DIRE/
├── linear_coor_SUF.txt # linear imputation output
└── recover_coor_SUF.txt # SnapFISH-IMPUTE result
Normalization of Pairwise Distances
SnapFISH-IMPUTE includes a normalization module to remove 1D genomic distance bias from the data and transform the distribution to approximately $N(0,1)$. This module can be called in Python by
data = sfimpute.preprocess.read_data("PATH/TO/COOR.txt")
ann = sfimpute.preprocess.read_data("PATH/TO/ann.txt")
pdist_df = sfimpute.impute.to_dist_df(data, ann)
norm_df = sfimpute.impute.normalize_pdist_by1d(pdist_df)
For a more detailed tutorial about how pairwise distances are calculated and stored, please check pdistdemo.ipynb
.
Example
Use the 5kb chromatin tracing data of mESCs (Huang et al. 2021) as an example. The formatted data is in the folder data
. Using 50 processes:
mpiexec -np 50 python run_impute.py -o output
-d data/mESC_Sox2_coor_wnan.txt
-a data/mESC_Sox2_ann.txt
-s mESCs_5kb
The program will print how many values are still unavailable in each iteration:
Region 129 initial: 590610 NaN values
Region 129 resized by a factor of 2
Region 129: 56640 NaN values
Region 129 resized by a factor of 2
Region 129: 56640 NaN values
Region CAST initial: 547912 NaN values
Region CAST resized by a factor of 2
Region CAST: 0 NaN values
After the program finished, a directory named output with the following files will be generated
$ tree output
output/
├── linear_coor_mESCs_5kb.txt # linear imputation output
└── recover_coor_mESCs_5kb.txt # SnapFISH-IMPUTE result
Contact Us
If you have any question regarding SnapFISH-IMPUTE, you can send an email to Hongyu Yu (hongyuyu@unc.edu).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.