Compute similarity between genomic contact matrices with "Entropy 3C"
Project description
ENT3C is a method for qunatifying the similarity of micro-C/Hi-C derived chromosomal contact matrices. It is based on the von Neumann entropy1 and recent work on entropy quantification of Pearson correlation matrices2. For a contact matrix, ENT3C records the change in local pattern complexity of smaller Pearson-transformed submatrices along a matrix diagonal to generate a characteristic signal. Similarity is defined as the Pearson correlation between the respective entropy signals of two contact matrices.
https://github.com/X3N1A/ENT3C
Installation
-
generate and activate python environment
python3.11 -m venv .ent3c_venv source .ent3c_venv/bin/activate -
install ENT3C:
pip install ENT3C
Running ENT3C
Command-Line Usage
```
usage: ENT3C [-h] [--version] {get_entropy,get_similarity,run_all} --config-file=/path/to/config_file/<config.json>
```
* ```get_entropy``` subcommand generate a dataframe with entropy values according to <config.json>. Output: ```OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_OUT.csv>```
* ```get_similarity``` subcommand will generate a data frame with similarities according to <config.json> and ```OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_OUT.csv>```. Output: ```OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_similarity.csv```
* ```run_all``` will generate both ```OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_OUT.csv>``` and ```OUTPUT/PYTHON/<OUT_PREFIX>_<_ENT3C_similarity.csv``` data frames.
or as python API
import ENT3C
ENT3C.run_get_entropy("config/config.json")
ENT3C.run_get_similarity("config/config.json")
ENT3C.run_all("config/config.json")
Parameters and configuration files of ENT3C
-
The main ENT3C parameter affecting the final entropy signal $S$ is the dimension of the submatrices
SUB_M_SIZE_FIX.-
SUB_M_SIZE_FIXcan be either be fixed by or alternatively, one can specifyCHRSPLIT; in this caseSUB_M_SIZE_FIXwill be computed internally to fit the number of desired times the contact matrix is to be paritioned into.PHI=1+floor((N-SUB_M_SIZE)./phi)where
Nis the size of the input contact matrix,phiis the window shift,PHIis the number of evaluated submatrices (consequently the number of data points in $S$).
-
-
All implementations (
ENT3C.py,ENT3C.jlandENT3C.m) use a configuration file in JSON format.- example can be found in <config/config.json>
ENT3C parameters defined in config/config.json
-
"DATA_PATH": "DATA"$\dots$ input data path. -
input files in format:
[<COOL_FILENAME>, <SHORT_NAME>]
"FILES": [
"ENCSR079VIJ.BioRep1.40kb.cool",
"G401_BR1",
"ENCSR079VIJ.BioRep2.40kb.cool",
"G401_BR2"]
- ENT3C also takes
mcoolfiles as input. Please refer to biological replicates as "_BR%d" in the <SHORT_NAME>.
⚠ if comparing biological replicate samples, please ensure they are indicated as <_BR#> in the config file ⚠
-
"`OUT_DIR": "OUTPUT/"$\dots$ output directory.OUT_DIRwill be concatenated withOUTPUT/JULIA/orOUTPUT/MATLAB/. -
"OUT_PREFIX": "40kb"$\dots$ prefix for output files. -
"Resolution": "40e3,100e3"$\dots$ resolutions to be evaluated. -
"ChrNr": "15,16,17,18,19,20,21,22,X"$\dots$ chromosome numbers to be evaluated. -
"NormM": 0$\dots$ input contact matrices can be balanced. IfNormM: 1, balancing weights in cooler are applied. If set to 1, ENT3C expects weights to be in dataset/resolutions/<resolution>/bins/<WEIGHTS_NAME>. -
"WEIGHTS_NAME": "weight"$\dots$ name of dataset in cooler containing normalization weights. -
"SUB_M_SIZE_FIX": null$\dots$ fixed submatrix dimension. -
"CHRSPLIT": 10$\dots$ number of submatrices into which the contact matrix is partitioned into. -
"phi": 1$\dots$ number of bins to the next matrix. -
"PHI_MAX": 1000$\dots$ number of submatrices; i.e. number of data points in entropy signal $S$. If set, $\varphi$ is increased until $\Phi \approx \Phi_{\max}$.
Output files:
<OUT_DIR>/<OUTPUT_PREFIX>_ENT3C_similarity.csv$\dots$ will contain all combinations of comparisons. The second two columns contain the short names specified inFILESand the third columnQthe corresponding similarity score.
Resolution ChrNr Sample1 Sample2 Q
cat OUTPUT/PYTHON/EvenChromosomes_NoWeights_ENT3C_similarity.csv | head
Resolution ChrNr Sample1 Sample2 Q
40000 2 HFFc6_BR2 A549_BR2 0.5584659814117208
40000 2 HFFc6_BR2 G401_BR2 0.6594518933893059
40000 2 HFFc6_BR2 HFFc6_BR1 0.8473530463515314
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
<OUT_DIR>/<OUTPUT_PREFIX>_ENT3C_OUT.csv$\dots$ ENT3C output table.
Name ChrNr Resolution n PHI phi binNrStart binNrEND START END S
cat OUTPUT/PYTHON/EvenChromosomes_NoWeights_ENT3C_similarity.csv | head
Resolution ChrNr Sample1 Sample2 Q
Name ChrNr Resolution n PHI phi binNrStart binNrEnd START END S
G401_BR1 2 40000 600 901 6 0 599 0 24000000 4.067424893091131
G401_BR1 2 40000 600 901 6 6 605 240000 24240000 4.06198007393338
G401_BR1 2 40000 600 901 6 12 611 480000 24480000 4.055473536905049
G401_BR1 2 40000 600 901 6 18 617 720000 24720000 4.048004132456738
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Each row corresponds to an evaluated submatrix with fields Name (the short name specified in FILES), ChrNr, Resolution, the sub-matrix dimension sub_m_dim, PHI=1+floor((N-SUB_M_SIZE)./phi), binNrStart and binNrEnd correspond to the start and end bin of the submatrix, START and END are the corresponding genomic coordinates and S is the computed von Neumann entropy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file ent3c-2.1.2.tar.gz.
File metadata
- Download URL: ent3c-2.1.2.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
caaf7537eb7d4aab882b4c30d40b52a298bed5c865cbb4b8489aa365bcb80e55
|
|
| MD5 |
97c2bcbb3b7f7e8f6a63af714368d3d0
|
|
| BLAKE2b-256 |
b69a432d815e3045c0c257cee39e59ec0cc8e3cced53db08b109f3589400bfc5
|