A user-friendly package for processing single cell HiC data
Project description
scHiCTools
Summary
A computational toolbox for analyzing single cell Hi-C (high-throughput sequencing for 3C) data which includes functions for:
- Load single-cell HiC datasets
- Smoothing the contact maps with linear convolution, random walk or network enhancing
- Calculating embeddings for single cell HiC datasets efficiently with reproducibility measures include InnerProduct, fastHiCRep and Selfish
Installation
Required Python Packages
- Python (version >= 3.6)
- numpy (version >= 1.15.4)
- scipy (version >= 1.0)
Install from GitHub
You can install the package with following command:
$ git clone https://github.com/liu-bioinfo-lab/scHiCTools.git
$ cd scHiCTools
$ python setup.py install
Usage
Supported Formats
-
Pre-processed Matrices: If the data is already processed into matrices for intra-chromosomal contacts, the chromosome from the same cell must be stored in the same folder with chromosome names as file names (e.g., scHiC/cell_1/chr1.txt). You only need to provide the folder name for a cell (e.g., scHiC/cell_1).
- npy: numpy.array / numpy.matrix
- npz: scipy.sparse.coo_matrix
- matrix: matrix stored as pure text
- matrix_txt: matrix stored as .txt file
- HiCRep: the format required by HiCRep package
-
Edge List
For all formats below:
str - strand (forward / reverse)
chr - chromosome
pos - position
score - contact reads
frag - fragments (will be ignored)
mapq - map quality- Shortest
<chr1> <pos1> <chr2> <pos2>
- Shortest_Score
<chr1> <pos1> <chr2> <pos2> <score>
- Short
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
- Short_Score
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <score>
- Medium
<readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
- Long
<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <cigar1> <sequence1> <mapq2> <cigar2> <sequence2> <readname1> <readname2>
- 4DN
## pairs format v1.0 #columns: readID chr1 position1 chr2 position2 strand1 strand2
-
.hic format: we adapted "straw" from JuiceTools.
-
.mcool format: we adapted "dump" from cool.
-
Other formats: simply give the indices (start from 1) in the order of
"chromosome1 - position1 - chromosome2 - position2 - score" or
"chromosome1 - position1 - chromosome2 - position2" or
"chromosome1 - position1 - chromosome2 - position2 - mapq1 - mapq2".
For example, you can provide "2356" or [2, 3, 5, 6] if the file takes this format:
<name> <chromosome1> <position1> <frag1> <chromosome2> <position2> <frag2> <strand1> <strand2>
contact_1 chr1 3000000 1 chr1 3001000 1 + -
Import Package
>>>import scHiCTools
Load scHiC data
The scHiC data is stored in a series of files, with each of the files corresponding to one cell. You need to specify the list of scHiC file paths.
Only intra-chromosomal interactions are counted.
>>>from scHiCTools import scHiCs
>>>files = ['./cell_1', './cell_2', './cell_3']
>>>loaded_data = scHiCs(
... files, reference_genome='mm9',
... resolution=500000, keep_n_strata=10,
... format='customized', adjust_resolution=True,
... customized_format=12345, header=0, chromosomes='except Y',
... operations=['OE_norm', 'convolution']
... )
- reference genome (dict or str): now supporting 'mm9', 'mm10', 'hg19', 'hg38'. If your reference genome is not in ['mm9', 'mm10', 'hg19', 'hg38'], you need to provide the lengths of chromosomes you are going to use with a Python dict. e.g. {'chr1': 150000000, 'chr2': 130000000, 'chr3': 200000000}
- resolution (int): the resolution to separate genome into bins. If using .hic file format, the given resolution must match with the resolutions in .hic file.
- keep_n_strata (None or int): only store contacts within n strata near the diagonal. Default: None. If 'None', it will not store strata
- store_full_map (bool): whether store full contact maps in numpy matrices or scipy sparse matrices,If False, it will save memory.
- sparse (bool): whether to use sparse matrices
- format (str): file format, supported formats: 'shortest', 'shortest_score', 'short', 'short_score' , 'medium', 'long', '4DN', '.hic', '.mcool', 'npy', 'npz', 'matrix', 'HiCRep', 'matrix_txt' and 'customized'. Default: 'customized'.
- customized_format (int or str or list): the column indices in the order of "chromosome 1 - position 1 - chromosome 2 - position 2 - contact reads" or "chromosome 1 - position 1 - chromosome 2 - position 2" or "chromosome 1 - position 1 - chromosome 2 - position 2 - map quality 1 - map quality 2". e.g. if the line is "chr1 5000000 chr2 3500000 2", the format should be '12345' or [1, 2, 3, 4, 5]; if there is no column indicating number of reads, you can just provide 4 numbers like '1234', and contact read will be set as 1. Default: '12345'.
- adjust_resolution: whether to adjust resolution for the input file. Sometimes the input file is already in the proper resolution (e.g. position 3000000 has already been changed to 6 with 500kb resolution). For this situation you can set adjust_resolution=False. Default: True.
- map_filter (float): keep all contacts with mapq higher than this threshold. Default: 0.0
- header (int): how many header lines does the file have. Default: 0.
- chromosomes (list or str): chromosomes to use, eg. ['chr1', 'chr2'], or just 'except Y', 'except XY', 'all'. Default: 'all', which means chr 1-19 + XY for mouse and chr 1-22 + XY for human.
- operations (list or None): the operations use for pre-processing or smoothing the maps given in a list. The operations will happen in the given order. Supported operations: 'logarithm', 'power', 'convolution', 'random_walk', 'network_enhancing', 'OE_norm', 'VC_norm', 'VC_SQRT_norm', 'KR_norm'。 Default: None.
- For preprocessing and smoothing operations, sometimes you need additional arguments (introduced in next sub-section).
You can also skip pre-processing and smoothing in loading step (operations=None), and do them in next lines:
>>>loaded_data.processing(['random_walk', 'network_enhancing'])
But if you didn't store full map (i.e. store_full_map=False), processing is not doable in a separate step.
Pre-processing and Smoothing Operations
-
logarithm: new_W_ij = log_(base) (W_ij + epsilon). Additional arguments:
- log_base: default: e
- epsilon: default: 1
-
power: new_W_ij = (W_ij)^pow. Additional argument:
- pow: default: 0.5 (i.e., sqrt(W_ij))
-
VC_norm: VC normalization - each value divided by the sum of corresponding row then divided by the sum of corresponding column
-
VC_SQRT_norm: VC_SQRT normalization - each value divided by the sqrt of the sum of corresponding row then divided by the sqrt of the sum of corresponding column
-
KR_norm: KR normalization - iterating until the sum of each row / column is one Argument:
- maximum_error_rate (float): iteration ends when max error is smaller than (maximum_error_rate). Default: 1e-4
-
OE_norm: OE normalization - each value divided by the average of its corresponding strata (diagonal line)
-
convolution: smoothing with a N by N convolution kernel, with each value equal to 1/N^2. Argument:
- kernel_shape: an integer. e.g. kernel_shape=3 means a 3*3 matrix with each value = 1/9. Default: 3.
-
Random walk: multiply by a transition matrix (also calculated from contact map itself). Argument:
- random_walk_ratio: a value between 0 and 1, e.g. if ratio=0.9, the result will be 0.9 * matrix_after_random_walk + 0.1 * original_matrix. Default: 1.0.
-
Network enhancing: transition matrix only comes from k-nearest neighbors of each line. Arguments:
- kNN: value 'k' in kNN. Default: 20.
- iterations: number of iterations for network enhancing. Default: 1
- alpha: similar with random_walk_ratio. Default: 0.9
Learn Embeddings
>>>embs = loaded_data.learn embedding(
... dim=2, similarity_method='inner_product', embedding_method='MDS',
... n_strata=None, aggregation='median', return_distance=False
... )
This function will return the embeddings in the format of a numpy array with shape (#_of_cells, #_of_dimensions).
- dim (int): the dimension for the embedding
- similarity_method (str): reproducibility measure, 'InnerProduct', 'fastHiCRep' or 'Selfish'. Default: 'InnerProduct'
- embedding_method (str): 'MDS', 'tSNE' or 'UMAP'
- n_strata (int): only consider contacts within this genomic distance. Default: None. If it is None, it will use the all strata kept (the argument keep_n_strata from previous loading process). Thus n_strata and keep_n_strata (loading step) cannot be None at the same time.
- aggregation (str): method to aggregate different chromosomes, 'mean' or 'median'. Default: 'median'.
- return_distance (bool): if True, return (embeddings, distance_matrix); if False, only return embeddings. Default: False.
- Some additional argument for Selfish:
- n_windows (int): split contact map into n windows, default: 10
- sigma (float): sigma in the Gaussian-like kernel: default: 1.6
Citation
Fan Feng, and Jie Liu. "scHiCTools: a computational toolbox for analyzing single cell Hi-C data." bioRxiv (2019): 769513.
References
A. R. Ardakany, F. Ay, and S. Lonardi. Selfish: Discovery of differential chromatininteractions via a self-similarity measure.bioRxiv, 2019.
N. C. Durand, J. T. Robinson, M. S. Shamim, I. Machol, J. P. Mesirov, E. S. Lander, and E. Lieberman Aiden. "Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom." Cell Systems 3(1), 2016.
J. Liu, D. Lin, G. Yardimci, and W. S. Noble. Unsupervised embedding of single-cellHi-C data.Bioinformatics, 34:96–104, 2018.
T. Nagano, Y. Lubling, C. Várnai, C. Dudley, W. Leung, Y. Baran, N. M. Cohen,S. Wingett, P. Fraser, and A. Tanay. Cell-cycle dynamics of chromosomal organization at single-cell resolution.Nature, 547:61–67, 2017.
B. Wang, A. Pourshafeie, M. Zitnik, J. Zhu, C. D. Bustamante, S. Batzoglou, andJ. Leskovec. Network enhancement as a general method to denoise weighted biological networks.Nature Communications, 9(1):3108, 2018.
T. Yang, F. Zhang, G. G. Y. mcı, F. Song, R. C. Hardison, W. S. Noble, F. Yue, andQ. Li. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient.Genome Research, 27(11):1939–1949, 2017.
G. G. Yardımcı, H. Ozadam, M. E. Sauria, O. Ursu, K. K. Yan, T. Yang,A. Chakraborty, A. Kaul, B. R. Lajoie, F. Song, et al. Measuring the reproducibilityand quality of hi-c data.Genome Biology, 20(1):57, 2019.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.