Skip to main content

Compute similarity between genomic contact matrices with "Entropy 3C"

Project description

ENT3C is a method for qunatifying the similarity of micro-C/Hi-C derived chromosomal contact matrices. It is based on the von Neumann entropy1 and recent work on entropy quantification of Pearson correlation matrices2. For a contact matrix, ENT3C records the change in local pattern complexity of smaller Pearson-transformed submatrices along a matrix diagonal to generate a characteristic signal. Similarity is defined as the Pearson correlation between the respective entropy signals of two contact matrices.

https://github.com/X3N1A/ENT3C

Summary of ENT3C approach

  1. Loads cooler files and looks for shared empty bins.

  2. ENT3C will first take the logarithm of an input matrix $\mathbf{M}$

  3. Next, smaller submatrices $\mathbf{a}$ of dimension $n\times n$ are extracted along the diagonal of an input contact matrix $\mathbf{M}$

  4. $nan$ values in $\mathbf{a}$ are set to the minimum value in $\mathbf{a}$.

  5. $\mathbf{a}$ is transformed into a Pearson correlation matrix $\mathbf{P}$.

  6. $\mathbf{P}$ is transformed into $\boldsymbol{\rho}=\mathbf{P}/n$ to fulfill the conditions for computing the von Neumann entropy.

  7. The von Neumann entropy of $\boldsymbol{\rho}$ is computed as

    $S(\boldsymbol{\rho})=\sum_j \lambda_j \log \lambda_j$

    where $\lambda_j$ is the $j$ th eigenvalue of $\boldsymbol{\rho}$

  8. This is repeated for subsequent submatrices along the diagonal of the input matrix and stored in the entropy signal $\mathbf{S}_{M}$.

  9. Similarity $Q$ is defined as the Pearson correlation $r$ between the entropy signals of two matrices: $Q(\mathbf{M}_1,\mathbf{M}_2) = r(\mathbf{S}_{\mathbf{M}_1},\mathbf{S}_{\mathbf{M}_2})$.

Requirements

  • Python (>=3.12)

  • generate and activate python environment

     python3.12 -m venv .ent3c\_venv
    
     source .ent3c\_venv/bin/activate
    
  • install ENT3C and requirements via pyproject.toml:

     pip install .
    
  • requirements are listed in requirements.txt

Running ENT3C

Command-Line Usage

  • run ENT3C directly from terminal with:
ENT3C <get_entropy|get_similarity|run_all> --config-file=/path/to/config_file/<config.json>
  • <get_entropy> subcommand generate a dataframe with entropy values according to <config.json>. Output: OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_OUT.csv>

  • <get_similarity> subcommand will generate a data frame with similarities according to <config.json> and OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_OUT.csv>. Output: OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_similarity.csv

  • <run_all> will generate both OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_OUT.csv> and OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_similarity.csv data frames.

or as python API

import ENT3C
ENT3C.run_get_entropy("config/config.json")
ENT3C.run_get_similarity("config/config.json")
ENT3C.run_all("config/config.json")

Parameters and configuration files of ENT3C

  • The main ENT3C parameter affecting the final entropy signal $S$ is the dimension of the submatrices SUB_M_SIZE_FIX.

    • SUB_M_SIZE_FIX can be either be fixed by or alternatively, one can specify CHRSPLIT; in this case SUB_M_SIZE_FIX will be computed internally to fit the number of desired times the contact matrix is to be paritioned into.

      PHI=1+floor((N-SUB_M_SIZE)./phi)

      where N is the size of the input contact matrix, phi is the window shift, PHI is the number of evaluated submatrices (consequently the number of data points in $S$).

  • All implementations (ENT3C.py, ENT3C.jl and ENT3C.m) use a configuration file in JSON format.

    • example can be found in <config/config.json>

ENT3C parameters defined in config/config.json

  1. "DATA_PATH": "DATA" $\dots$ input data path.

  2. input files in format: [<COOL_FILENAME>, <SHORT_NAME>]

"FILES": [
	"ENCSR079VIJ.BioRep1.40kb.cool",
 
	"G401_BR1",
 
	"ENCSR079VIJ.BioRep2.40kb.cool",
 
	"G401_BR2"]
  • ENT3C also takes mcool files as input. Please refer to biological replicates as "_BR%d" in the <SHORT_NAME>.

⚠ if comparing biological replicate samples, please ensure they are indicated as <_BR#> in the config file ⚠

  1. "`OUT_DIR": "OUTPUT/" $\dots$ output directory. OUT_DIR will be concatenated with OUTPUT/JULIA/ or OUTPUT/MATLAB/.

  2. "OUT_PREFIX": "40kb" $\dots$ prefix for output files.

  3. "Resolution": "40e3,100e3" $\dots$ resolutions to be evaluated.

  4. "ChrNr": "15,16,17,18,19,20,21,22,X" $\dots$ chromosome numbers to be evaluated.

  5. "NormM": 0 $\dots$ input contact matrices can be balanced. If NormM: 1, balancing weights in cooler are applied. If set to 1, ENT3C expects weights to be in dataset /resolutions/<resolution>/bins/<WEIGHTS_NAME>.

  6. "WEIGHTS_NAME": "weight" $\dots$ name of dataset in cooler containing normalization weights.

  7. "SUB_M_SIZE_FIX": null $\dots$ fixed submatrix dimension.

  8. "CHRSPLIT": 10 $\dots$ number of submatrices into which the contact matrix is partitioned into.

  9. "phi": 1 $\dots$ number of bins to the next matrix.

  10. "PHI_MAX": 1000 $\dots$ number of submatrices; i.e. number of data points in entropy signal $S$. If set, $\varphi$ is increased until $\Phi \approx \Phi_{\max}$.

Output files:

  • <OUT_DIR>/<OUTPUT_PREFIX>_ENT3C_similarity.csv $\dots$ will contain all combinations of comparisons. The second two columns contain the short names specified in FILES and the third column Q the corresponding similarity score.
Resolution	ChrNr	Sample1	Sample2	Q
cat OUTPUT/PYTHON/EvenChromosomes_NoWeights_ENT3C_similarity.csv  | head
Resolution	ChrNr	Sample1	Sample2	Q
40000	2	HFFc6_BR2	A549_BR2	0.5584659814117208
40000	2	HFFc6_BR2	G401_BR2	0.6594518933893059
40000	2	HFFc6_BR2	HFFc6_BR1	0.8473530463515314
.		.	.		.	.	.		.		.		.		.
.		.	.		.	.	.		.		.		.		.
.		.	.		.	.	.		.		.		.		.
  • <OUT_DIR>/<OUTPUT_PREFIX>_ENT3C_OUT.csv $\dots$ ENT3C output table.
Name	ChrNr	Resolution	n	PHI	phi	binNrStart	binNrEND	START	END	S
cat OUTPUT/PYTHON/EvenChromosomes_NoWeights_ENT3C_similarity.csv  | head
Resolution	ChrNr	Sample1	Sample2	Q
Name	ChrNr	Resolution	n	PHI	phi	binNrStart	binNrEnd	START	END	S
G401_BR1	2	40000	600	901	6	0	599	0	24000000	4.067424893091131
G401_BR1	2	40000	600	901	6	6	605	240000	24240000	4.06198007393338
G401_BR1	2	40000	600	901	6	12	611	480000	24480000	4.055473536905049
G401_BR1	2	40000	600	901	6	18	617	720000	24720000	4.048004132456738
.		.	.		.	.	.		.		.		.		.
.		.	.		.	.	.		.		.		.		.
.		.	.		.	.	.		.		.		.		.

Each row corresponds to an evaluated submatrix with fields Name (the short name specified in FILES), ChrNr, Resolution, the sub-matrix dimension sub_m_dim, PHI=1+floor((N-SUB_M_SIZE)./phi), binNrStart and binNrEnd correspond to the start and end bin of the submatrix, START and END are the corresponding genomic coordinates and S is the computed von Neumann entropy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ent3c-2.0.5.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ent3c-2.0.5-py3-none-any.whl (25.1 kB view details)

Uploaded Python 3

File details

Details for the file ent3c-2.0.5.tar.gz.

File metadata

  • Download URL: ent3c-2.0.5.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for ent3c-2.0.5.tar.gz
Algorithm Hash digest
SHA256 d4f401e43f191d0254f18f6f9ba5f817ef5bfe25da1a0377bcfd20e0222f9074
MD5 2cb12d93c628072a488353e2d2e66565
BLAKE2b-256 b82ff55899dc45bdf4af12a1c9e86d81cb79c3b9c5a581816f3160c2f7830fc1

See more details on using hashes here.

File details

Details for the file ent3c-2.0.5-py3-none-any.whl.

File metadata

  • Download URL: ent3c-2.0.5-py3-none-any.whl
  • Upload date:
  • Size: 25.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for ent3c-2.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c0b4626aaf39e4c857cec2c25c5a2458116e5cfa6d75f81902bad4e7b6e0f2ad
MD5 732ef9c21ab1cafa1662f5fb49593ed9
BLAKE2b-256 b571494de3858d58f9a9b3cf3494b8d3c8c6bf89261c4fc3ce2c8cd049e14276

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page