Skip to main content

A user-friendly package for processing single cell HiC data

Project description

scHiCTools

Summary

A computational toolbox for analyzing single cell Hi-C (high-throughput sequencing for 3C) data which includes functions for:

  1. Load single-cell HiC datasets
  2. Smoothing the contact maps with linear convolution, random walk or network enhancing
  3. Calculating embeddings for single cell HiC datasets efficiently with reproducibility measures include InnerProduct, fastHiCRep and Selfish

Installation

Required Python Packages

  • Python (version >= 3.6)
  • numpy (version >= 1.15.4)
  • scipy (version >= 1.0)

Install from GitHub

You can install the package with following command:

  $ git clone https://github.com/liu-bioinfo-lab/scHiCTools.git
  $ cd scHiCTools
  $ python setup.py install

Usage

Supported Formats

  • Pre-processed Matrices: If the data is already processed into matrices for intra-chromosomal contacts, the chromosome from the same cell must be stored in the same folder with chromosome names as file names (e.g., scHiC/cell_1/chr1.txt). You only need to provide the folder name for a cell (e.g., scHiC/cell_1).

    • npy: numpy.array / numpy.matrix
    • npz: scipy.sparse.coo_matrix
    • matrix: matrix stored as pure text
    • matrix_txt: matrix stored as .txt file
    • HiCRep: the format required by HiCRep package
  • Edge List
    For all formats below:
      str - strand (forward / reverse)
      chr - chromosome
      pos - position
      score - contact reads
      frag - fragments (will be ignored)
      mapq - map quality

    • Shortest
    <chr1> <pos1> <chr2> <pos2>
    
    • Shortest_Score
    <chr1> <pos1> <chr2> <pos2> <score>
    
    • Short
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
    
    • Short_Score
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <score>
    
    • Medium
    <readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
    
    • Long
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <cigar1> <sequence1> <mapq2> <cigar2> <sequence2> <readname1> <readname2>
    
    • 4DN
    ## pairs format v1.0
    #columns: readID chr1 position1 chr2 position2 strand1 strand2
    
  • .hic format: we adapted "straw" from JuiceTools.

  • .mcool format: we adapted "dump" from cool.

  • Other formats: simply give the indices (start from 1) in the order of
    "chromosome1 - position1 - chromosome2 - position2 - score" or
    "chromosome1 - position1 - chromosome2 - position2" or
    "chromosome1 - position1 - chromosome2 - position2 - mapq1 - mapq2".
    For example, you can provide "2356" or [2, 3, 5, 6] if the file takes this format:

<name> <chromosome1> <position1> <frag1> <chromosome2> <position2> <frag2> <strand1> <strand2>
contact_1 chr1 3000000 1 chr1 3001000 1 + -

Import Package

>>>import scHiCTools

Load scHiC data

The scHiC data is stored in a series of files, with each of the files corresponding to one cell. You need to specify the list of scHiC file paths.

Only intra-chromosomal interactions are counted.

>>>from scHiCTools import scHiCs
>>>files = ['./cell_1', './cell_2', './cell_3']
>>>loaded_data = scHiCs(
... files, reference_genome='mm9',
... resolution=500000, keep_n_strata=10,
... format='customized', adjust_resolution=True,
... customized_format=12345, header=0, chromosomes='except Y',
... operations=['OE_norm', 'convolution']
... )
  • reference genome (dict or str): now supporting 'mm9', 'mm10', 'hg19', 'hg38'. If your reference genome is not in ['mm9', 'mm10', 'hg19', 'hg38'], you need to provide the lengths of chromosomes you are going to use with a Python dict. e.g. {'chr1': 150000000, 'chr2': 130000000, 'chr3': 200000000}
  • resolution (int): the resolution to separate genome into bins. If using .hic file format, the given resolution must match with the resolutions in .hic file.
  • keep_n_strata (None or int): only store contacts within n strata near the diagonal. Default: None. If 'None', it will not store strata
  • store_full_map (bool): whether store full contact maps in numpy matrices or scipy sparse matrices,If False, it will save memory.
  • sparse (bool): whether to use sparse matrices
  • format (str): file format, supported formats: 'shortest', 'shortest_score', 'short', 'short_score' , 'medium', 'long', '4DN', '.hic', '.mcool', 'npy', 'npz', 'matrix', 'HiCRep', 'matrix_txt' and 'customized'. Default: 'customized'.
  • customized_format (int or str or list): the column indices in the order of "chromosome 1 - position 1 - chromosome 2 - position 2 - contact reads" or "chromosome 1 - position 1 - chromosome 2 - position 2" or "chromosome 1 - position 1 - chromosome 2 - position 2 - map quality 1 - map quality 2". e.g. if the line is "chr1 5000000 chr2 3500000 2", the format should be '12345' or [1, 2, 3, 4, 5]; if there is no column indicating number of reads, you can just provide 4 numbers like '1234', and contact read will be set as 1. Default: '12345'.
  • adjust_resolution: whether to adjust resolution for the input file. Sometimes the input file is already in the proper resolution (e.g. position 3000000 has already been changed to 6 with 500kb resolution). For this situation you can set adjust_resolution=False. Default: True.
  • map_filter (float): keep all contacts with mapq higher than this threshold. Default: 0.0
  • header (int): how many header lines does the file have. Default: 0.
  • chromosomes (list or str): chromosomes to use, eg. ['chr1', 'chr2'], or just 'except Y', 'except XY', 'all'. Default: 'all', which means chr 1-19 + XY for mouse and chr 1-22 + XY for human.
  • operations (list or None): the operations use for pre-processing or smoothing the maps given in a list. The operations will happen in the given order. Supported operations: 'logarithm', 'power', 'convolution', 'random_walk', 'network_enhancing', 'OE_norm', 'VC_norm', 'VC_SQRT_norm', 'KR_norm'。 Default: None.
  • For preprocessing and smoothing operations, sometimes you need additional arguments (introduced in next sub-section).

You can also skip pre-processing and smoothing in loading step (operations=None), and do them in next lines:

>>>loaded_data.processing(['random_walk', 'network_enhancing'])

But if you didn't store full map (i.e. store_full_map=False), processing is not doable in a separate step.

Pre-processing and Smoothing Operations

  • logarithm: new_W_ij = log_(base) (W_ij + epsilon). Additional arguments:

    • log_base: default: e
    • epsilon: default: 1
  • power: new_W_ij = (W_ij)^pow. Additional argument:

    • pow: default: 0.5 (i.e., sqrt(W_ij))
  • VC_norm: VC normalization - each value divided by the sum of corresponding row then divided by the sum of corresponding column

  • VC_SQRT_norm: VC_SQRT normalization - each value divided by the sqrt of the sum of corresponding row then divided by the sqrt of the sum of corresponding column

  • KR_norm: KR normalization - iterating until the sum of each row / column is one Argument:

    • maximum_error_rate (float): iteration ends when max error is smaller than (maximum_error_rate). Default: 1e-4
  • OE_norm: OE normalization - each value divided by the average of its corresponding strata (diagonal line)

  • convolution: smoothing with a N by N convolution kernel, with each value equal to 1/N^2. Argument:

    • kernel_shape: an integer. e.g. kernel_shape=3 means a 3*3 matrix with each value = 1/9. Default: 3.
  • Random walk: multiply by a transition matrix (also calculated from contact map itself). Argument:

    • random_walk_ratio: a value between 0 and 1, e.g. if ratio=0.9, the result will be 0.9 * matrix_after_random_walk + 0.1 * original_matrix. Default: 1.0.
  • Network enhancing: transition matrix only comes from k-nearest neighbors of each line. Arguments:

    • kNN: value 'k' in kNN. Default: 20.
    • iterations: number of iterations for network enhancing. Default: 1
    • alpha: similar with random_walk_ratio. Default: 0.9

Learn Embeddings

>>>embs = loaded_data.learn embedding(
... dim=2, similarity_method='inner_product', embedding_method='MDS',
... n_strata=None, aggregation='median', return_distance=False
... )

This function will return the embeddings in the format of a numpy array with shape (#_of_cells, #_of_dimensions).

  • dim (int): the dimension for the embedding
  • similarity_method (str): reproducibility measure, 'InnerProduct', 'fastHiCRep' or 'Selfish'. Default: 'InnerProduct'
  • embedding_method (str): 'MDS', 'tSNE' or 'UMAP'
  • n_strata (int): only consider contacts within this genomic distance. Default: None. If it is None, it will use the all strata kept (the argument keep_n_strata from previous loading process). Thus n_strata and keep_n_strata (loading step) cannot be None at the same time.
  • aggregation (str): method to aggregate different chromosomes, 'mean' or 'median'. Default: 'median'.
  • return_distance (bool): if True, return (embeddings, distance_matrix); if False, only return embeddings. Default: False.
  • Some additional argument for Selfish:
    • n_windows (int): split contact map into n windows, default: 10
    • sigma (float): sigma in the Gaussian-like kernel: default: 1.6

Citation

Fan Feng, and Jie Liu. "scHiCTools: a computational toolbox for analyzing single cell Hi-C data." bioRxiv (2019): 769513.

References

A. R. Ardakany, F. Ay, and S. Lonardi. Selfish: Discovery of differential chromatininteractions via a self-similarity measure.bioRxiv, 2019.

N. C. Durand, J. T. Robinson, M. S. Shamim, I. Machol, J. P. Mesirov, E. S. Lander, and E. Lieberman Aiden. "Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom." Cell Systems 3(1), 2016.

J. Liu, D. Lin, G. Yardimci, and W. S. Noble. Unsupervised embedding of single-cellHi-C data.Bioinformatics, 34:96–104, 2018.

T. Nagano, Y. Lubling, C. Várnai, C. Dudley, W. Leung, Y. Baran, N. M. Cohen,S. Wingett, P. Fraser, and A. Tanay. Cell-cycle dynamics of chromosomal organization at single-cell resolution.Nature, 547:61–67, 2017.

B. Wang, A. Pourshafeie, M. Zitnik, J. Zhu, C. D. Bustamante, S. Batzoglou, andJ. Leskovec. Network enhancement as a general method to denoise weighted biological networks.Nature Communications, 9(1):3108, 2018.

T. Yang, F. Zhang, G. G. Y. mcı, F. Song, R. C. Hardison, W. S. Noble, F. Yue, andQ. Li. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient.Genome Research, 27(11):1939–1949, 2017.

G. G. Yardımcı, H. Ozadam, M. E. Sauria, O. Ursu, K. K. Yan, T. Yang,A. Chakraborty, A. Kaul, B. R. Lajoie, F. Song, et al. Measuring the reproducibilityand quality of hi-c data.Genome Biology, 20(1):57, 2019.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scHiCTools-0.0.1.tar.gz (53.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page