Skip to main content

DiffDomain can test the significant difference of TADs.

Project description

diffDomain-py3

A short description

DiffDomain is a new computational method for identifying reorganized TADs using chromatin contact maps from two biological conditions.

A long description diffDomain

The workflow of diffDomain is illustrated down below.

The goal is to test if a TAD identified in one biological condition has structural changes in another biological condition.

The core of diffDomain is formulating the problem as a hypothesis testing problem where the null hypothesis is that the TAD doesn’t undergo significant structural reorganization at later condition. The input are Hi-C contact matrices of the TAD region in the two biological conditions (A). The Hi-C contact matrices are log-transformed to adjust for the exponential decay of Hi-C contacts between chromosome bins with increased distances.

Their entry-wise difference is calculated (B).

The difference matrix D is normalized by iteratively standardizing its k-off diagonal parts, -N+2 <= k <= N-2, adjusting absolute differences in contact frequencies due to different sequencing depths in the two biological conditions (C).

Note that, standardization is TAD-specific. Each TAD has its own parameters that are only estimated from its contact matrices in a pair of biological conditions.

Intuitively, if a TAD is not significantly reorganized, normalized D would resemble a random matrix with white noise entries, enabling us to borrow theoretical results in random matrix theory. Indeed, normalized D is a generalized Wigner matrix (D), a well studied high-dimensional random matrices.

Its largest singular value is proved to be fluctuating around 2 under the null hypothesis. Armed with the fact, diffDomain reformulates the reorganized TAD identification problem into a hypothesis testing problem:
1. H0: the largest singular value equals to 2;
2. H1: the largest singular value is greater than 2.
For a user given set of TADs, P values are adjusted for multiple comparisons using BH method as default.
Once we identify the subset of reorganized TADs, we classify them into six subtypes to aid biological analysis and interpretations.

Installation instructions

diffDomain is tested on MacOS & Linux (Centos).

Dependences

diffDomain_py3 is dependent on - Python3 - hic-straw 1.3.1 - TracyWidom - pandas - numpy - docopt - matplotlib - statsmodels

Installation

Download diffDomain source package by running following command in a terminal:

git clone https://github.com/Tian-Dechao/diffDomain.git

or :

pip install diffdomain_py3

Get started with example usage

We downloaded data GEO:GSE63525 from Rao et al (2014) for standalone example usage of diffDomain. Example data saved in <data/>: 1. GM12878 TADs. 2. GM12878 combined Hi-C data on Chr1 that is extracted by Juicebox with resolution at 10 kb and normalization method at KR. The produced Hi-C data is 3-column: column 1 and column 2 are chromosomal bins, column 3 is KR normalized contact frequencies between the two bins. 3. K562 combined Hi-C data on Chr1. Settings are the same as GM12878.

Testing if one TAD is reorganized

In this example, we tested the GM12878 TAD that is reorganized in K562 (Chr1:163500000-165000000, Ref). Data are saved in <data/single-TAD/>.

Running the command

  • Usage: scriptname dvsd one <chr> <start> <end> <hic0> <hic1> [options]

python diffdomain/diffdomains.py dvsd one 1 163500000 165000000 data/single-TAD/GM12878_chr1_163500000_165000000_res_10k.txt data/single-TAD/K562_chr1_163500000_165000000_res_10k.txt --reso 10000 --ofile res/chr1_163500000_165000000.txt

diffDomain also provide visualization function to visualize Hi-C matrices side-by-side.

  • Usage: scriptname visualization <chr> <start> <end> <hic0> <hic1> [options]

Figure are saved in <res/images/>.

python diffdomain/diffdomains.py visualization 1 163500000 165000000 data/single-TAD/GM12878_chr1_163500000_165000000_res_10k.txt data/single-TAD/K562_chr1_163500000_165000000_res_10k.txt --reso 10000 --ofile res/images/side_by_side

Note: in this example, there is no need to do multiple comparison adjustment. Multiple comparisons adjustment by BH will be demonstrated in the next example.

Identifying the reorganized TADs on a 50 Mb region (Chr1:1-50,000,000)

In this example, multiple comparison adjustment is requried to adjust the P-values. chr1_50M_domainlist are saved in <data/TADs_chr1/>.

  • Usage: scriptname dvsd multiple <hic0> <hic1> <bed> [options]

python diffdomain/diffdomains.py dvsd multiple data/TADs_chr1/chr1_50M_GM12878.h5 data/TADs_chr1/chr1_50M_K562.h5 data/TADs_chr1/GM12878_chr1_50M_domainlist.txt --reso 10000 --ofile res/temp/GM12878_vs_K562_chr1_50M_temp.txt

The function pydiff.diffdomain_multiple will return the dataframe of result_mul.

  • Adjusting multiple comparisons by BH method (default, Optional parameters: fdr_by, bonferroni, holm, hommel etc.) and Filtering out reorganized TADs with BH < 0.05

  • Usage: scriptname adjustment <method> <input> <output>

python diffdomain/diffdomains.py adjustment fdr_bh res/temp/GM12878_vs_K562_chr1_50M_temp.txt res/GM12878_vs_K562_chr1_50M_adjusted_filter.tsv --filter true

For interactive integrative analysis, we recommend using the Nucleome Browser. Identifying GM12878 TADs that are reorganized in K562, using all TADs. ———————————————————————-

Data is using Amazon (from Aiden Lab).

  • Identify TADs in multiple chromosomes simultaneously.

python diffdomain/diffdomains.py dvsd multiple https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic https://hicfiles.s3.amazonaws.com/hiseq/k562/in-situ/combined.hic data/GSE63525_GM12878_primary+replicate_Arrowhead_domainlist.txt --ofile res/temp/temp.txt
  • MultiComparison adjustment.

python diffdomain/diffdomains.py adjustment fdr_bh res/temp/GM12878_vs_K562_chr1_50M_temp.txt res/adjusted_TADs2.txt
  • optional parameter [–filter], Filtering out reorganized TADs with BH < 0.05.

python diffdomain/diffdomains.py adjustment fdr_bh res/temp/GM12878_vs_K562_chr1_50M_temp.txt res/reorganized_TADs_GM12878_K562.tsv --filter true

The final output is saved to <res/reorganized_TADs_GM12878_K562.tsv>.

  • Classification of TADs

Running the command:

python diffdomain/classificattion.py -d adjusted_TADs2.txt -t GSE63525_K562_Arrowhead_domainlist.txt

Note: You can set the -l(–limit) to adjust the ‘common boundary’. As said in paper,we use ‘3bin’ as the filter of common bounday. That means if we use the 10kb resolution, we will set -l as 30000, and if 25kb resolution, -l will be 75000.

python diffdomain/classificattion.py -d adjusted_TADs2.txt\
 -t GSE63525_K562_Arrowhead_domainlist.txt\
 -l 30000

Contact information

More information please contact Dunming Hua at huadm@mail2.sysu.edu.cn, Ming Gu at guming5@mail2.sysu.edu.cn or Dechao Tian at tiandch@mail.sysu.edu.cn.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diffDomain-py3-0.1.0.tar.gz (12.9 kB view hashes)

Uploaded Source

Built Distribution

diffDomain_py3-0.1.0-py3-none-any.whl (13.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page