ENT3C is a method for qunatifying the similarity of micro-C/Hi-C derived chromosomal contact matrices. It is based on the von Neumann entropy1 and recent work on entropy quantification of Pearson correlation matrices2. For a contact matrix, ENT3C records the change in local pattern complexity of smaller Pearson-transformed submatrices along a matrix diagonal to generate a characteristic signal. Similarity is defined as the Pearson correlation between the respective entropy signals of two contact matrices.
https://doi.org/10.1093/nargab/lqae076
-
Loads cooler files and looks for shared empty bins.
-
ENT3C will first take the logarithm of an input matrix
$\mathbf{M}$ -
Next, smaller submatrices
$\mathbf{a}$ of dimension$n\times n$ are extracted along the diagonal of an input contact matrix$\mathbf{M}$ -
$nan$ values in$\mathbf{a}$ are set to the minimum value in$\mathbf{a}$ . -
$\mathbf{a}$ is transformed into a Pearson correlation matrix$\mathbf{P}$ . -
$\mathbf{P}$ is transformed into$\boldsymbol{\rho}=\mathbf{P}/n$ to fulfill the conditions for computing the von Neumann entropy. -
The von Neumann entropy of
$\boldsymbol{\rho}$ is computed as$S(\boldsymbol{\rho})=\sum_j \lambda_j \log \lambda_j$ where
$\lambda_j$ is the$j$ th eigenvalue of$\boldsymbol{\rho}$ -
This is repeated for subsequent submatrices along the diagonal of the input matrix and stored in the entropy signal
$\mathbf{S}_{M}$ . -
Similarity
$Q$ is defined as the Pearson correlation$r$ between the entropy signals of two matrices:$Q(\mathbf{M}_1,\mathbf{M}_2) = r(\mathbf{S}_{\mathbf{M}_1},\mathbf{S}_{\mathbf{M}_2})$ .
Exemplary epiction of ENT3C derivation of the entropy signal
-
generate and activate python environment
python3.11 -m venv .ent3c_venv source .ent3c_venv/bin/activate -
install ENT3C:
pip install ENT3C
Pre-built Linux executable is available in the Releases section (v.2.2.0).
- Download and make file executable.
chmod +x ./path/to/exe/ENT3C_exe
- For global use add path to
./bashrcfile:
export PATH="$PATH:/path/to/exe/"
-
💡 note that Python or executable is recommended.
-
compare_groupsis not currently available for MATLAB and Julia implementation. -
Matlab scripts in
matlab_version_ENT3Cdirectory. -
Julia scripts in
julia_version_ENT3C:- packages: DataFrames, BenchmarkTools, JSON, Printf, Plots, ColorSchemes, SuiteSparse, HDF5, NaNStatistics, Statistics, Combinatorics, CSV
- For the Julia implementation, ubuntu's hdf5-tools is also required
- Initial julia set-up
- option for automatic global installation
--install-deps=yes. (Works with any julia version) - predefined julia enviornments for julia versions 1.10.4 or 1.11.2 are defined in
project_files/<v.v.v>/Manifest.tomlandproject_files/<v.v.v>/Project.toml - option to load enviornments with
--resolve-env=yesand--julia-version=<v.v.v>
- option for automatic global installation
-
CLI (python) usage:
Usage: ENT3C <command> --config=<path/to/config.json> [options] Commands: get_entropy Generates entropy output file <entropy_out_FN> . get_similarity Generates similarity output file <similarity_out_FN> from <entropy_out_FN>. run_all Generates <entropy_out_FN> and <similarity_out_FN>. compare_groups Compare signal groups (requires --group1 and --group2 options) Global Options: --config=<path> Path to config JSON file (required for all commands) <compare_groups> Options: --group1=<GROUP> First group name, must correspond to what comes before _BR* in config file. --group2=<GROUP> Second group name, must correspond to what comes before _BR* in config file. Examples: ENT3C run_all --config=configs/myconfig.json ENT3C get_entropy --config=configs/myconfig.json ENT3C get_similarity --config=configs/myconfig.json ENT3C compare_groups --config=configs/myconfig.json --group1=H1-hESC --group2=K562 -
For linux executable use:
ENT3C_exe <command> --config=<path/to/config.json> [options] -
alternatively run ENT3C in python as:
import ENT3C ENT3C_OUT = ENT3C.run_get_entropy("config/myconfig.json") Similarity = ENT3C.run_get_similarity("config/myconfig.json") ENT3C_OUT, Similarity = ENT3C.run_all("config/myconfig.json") EUCLIDEAN = ENT3C.run_compare_groups("config/myconfig.json",group1,group2)
-
initial call for global package installation (see "initial julia set-up"):
julia ENT3C.jl --config-file=config/config.test.json --install-deps=yes -
after initialization:
julia ENT3C.jl --config-file=config/config.json -
alternativly load the predefined enviornments for julia 1.10.4 or 1.11.2
julia ENT3C.jl --config-file=config/config.json --resolve-env=yes --julia-version=<v.v.v>
💡 note the matlab and julia implementations will always generate both the entropy and similarity dataframes
matlab -nodesktop -nosplash -nodisplay -r "ENT3C('config/config.json'); exit"
💡 note the matlab and julia implementations will always generate the entropy and similarity dataframes
-
all ENT3C parameters are defined in .json files
config/config.json. Examples can be found inconfigdirectory. -
Paremeters defined in <config_file>:
-
The main ENT3C parameter affecting the final entropy signal
$S$ is the dimension of the submatricesSUB_M_SIZE_FIX.-
"SUB_M_SIZE_FIX": <integer>$\dots$ fixed submatrix dimension.-
SUB_M_SIZE_FIXcan be either be fixed by or alternatively, one can specifyCHRSPLIT; in this caseSUB_M_SIZE_FIXwill be computed internally to fit the number of desired times the contact matrix is to be paritioned into.
PHI=1+floor((N-SUB_M_SIZE)./phi)where
Nis the size of the input contact matrix,phiis the window shift,PHIis the number of evaluated submatrices (consequently the number of data points in$S$ ). -
-
"CHRSPLIT": <integer>$\dots$ number of times into which a$N \times N$ contact matrix is partitioned into which definingSUB_M_SIZE_FIX = floor(N/CHRSPLIT+0.5). If specified, then"SUB_M_SIZE_FIX": nullotherwise"CHRSPLIT": null.
-
-
"DATA_PATH": </path/to/data>$\dots$ input data path. -
input files in format:
[<COOL_FILENAME>, <SHORT_NAME>]"FILES": [ "ENCSR079VIJ.BioRep1.40kb.cool", "G401_BR1", "ENCSR079VIJ.BioRep2.40kb.cool", "G401_BR2"]-
Any biological replicates must be indicated in <SHORT_NAME> using the suffix "_BR%d".
-
Note: ENT3C also takes
mcoolfiles as input.
-
-
"`OUT_DIR": "<desired_output_directory_name>"$\dots$ output directory.OUT_DIRwill be concatenated withOUTPUT/JULIA/orOUTPUT/MATLAB/. -
"OUT_PREFIX": "<desired_output_prefix_>"$\dots$ prefix for output files. -
"Resolution": "<integer,integer,...>" e.g. "40e3,100e3"$\dots$ resolutions to be evaluated. -
"ChrNr": "<integer,integer,...>" "15,16,17,18,19,20,21,22,X"$\dots$ chromosome numbers to be evaluated. -
"NormM": <0|1>$\dots$ input contact matrices can be balanced. IfNormM: 1, balancing weights in cooler are applied. If set to 1, ENT3C expects weights to be in dataset/resolutions/<resolution>/bins/<WEIGHTS_NAME>. -
"WEIGHTS_NAME": "<name_of_weights>"$\dots$ name of dataset in cooler containing normalization weights. -
"phi": <integer>$\dots$ number of bins to the next matrix. -
"PHI_MAX": <integer>$\dots$ number of submatrices; i.e. number of data points in entropy signal$S$ . If set,$\varphi$ is increased until$\Phi \approx \Phi_{\max}$ .
-
-
<OUT_DIR>/<OUTPUT_PREFIX>_ENT3C_similarity.csv$\dots$ will contain all combinations of comparisons. The second two columns contain the short names specified inFILESand the third columnQthe corresponding similarity score.OUTPUT/PYTHON/EvenChromosomes_NoWeights_ENT3C_similarity.csv:Resolution ChrNr Sample1 Sample2 Q 40000 2 G401_BR1 G401_BR2 0.9978330002118974 40000 2 G401_BR1 LNCap_BR1 0.4129094106283695 40000 2 G401_BR1 LNCap_BR2 0.3049196919642929 . . . . . . . . . . . . . . . -
<OUT_DIR>/<OUTPUT_PREFIX>_ENT3C_OUT.csv$\dots$ ENT3C output table.OUTPUT/PYTHON/EvenChromosomes_NoWeights_ENT3C_OUT.csv:Name ChrNr Resolution n PHI phi binNrStart binNrEnd START END S G401_BR1 2 40000 500 918 6 0 499 0 20000000 3.7896426915562462 G401_BR1 2 40000 500 918 6 6 505 240000 20240000 3.789044181663418 G401_BR1 2 40000 500 918 6 12 511 480000 20480000 3.7918253959272032 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Each row corresponds to an evaluated submatrix with fields
Name(the short name specified inFILES),ChrNr,Resolution, the sub-matrix dimensionsub_m_dim,PHI=1+floor((N-SUB_M_SIZE)./phi),binNrStartandbinNrEndcorrespond to the start and end bin of the submatrix,STARTandENDare the corresponding genomic coordinates andSis the computed von Neumann entropy.- Example of output generated for
ENT3C get_entropy --config=config/myconfig.json:EvenChromosomes_NoWeights_40kb_ENT3C_signals.pdf- unbalanced 40kb contact matrices for even chromosomes across 5 cell lines.
SUB_MATRIX_SIZEwas 500:
- Example of output generated for
-
<OUT_DIR>/<OUTPUT_PREFIX>_Eucl_<group1>vs<group2>.csv$\dots$ Euclidean distance between average z-scores of S over<group1>and<group2>: (here group1=HFFc6, group2=G401). Arranged in descending order of$meanS_Euclidean$ .Resolution ChrNr START END meanS_Euclidean 40000 6 62360000 82360000 3.3625023926723685 40000 6 62120000 82120000 3.3546076641065095 40000 6 61880000 81880000 3.3441925121710026 . . . . . . . . . . . . . . .- Example of first page of output generated for
ENT3C compare_groups --config=config/myconfig.json --group1 = HFFc6 group2 = "G401"EvenChromosomes_NoWeights_Eucl_40kb_HFFc6vsG401.pdf
- Example of first page of output generated for
Both Julia and MATLAB implementations (ENT3C.jl and ENT3C.m) were tested on Hi-C and micro-C contact matrices binned at 40 kb in cool format.
micro-C
| Cell line | Biological Replicate (BR) | Accession (Experiemnt set) | Accession (pairs) |
|---|---|---|---|
| H1-hESC | 1 | 4DNES21D8SP8 | 4DNFING6ZFD, 4DNFIBMG8YA3, 4DNFIMT4PHZ1, 4DNFI8GM4EL9 |
| H1-hESC | 2 | 4DNES21D8SP8 | 4DNFIIYUGYBU, 4DNFI89L17XY, 4DNFIXP9MVBU, 4DNFI2YHYAJO, 4DNFIULY29IQ |
| HFFc6 | 1 | 4DNESphiT3UBH | 4DNFIN7IIIY6, 4DNFIJZDEIZ3, 4DNFIYBTHGNA, 4DNFIK8UIB5B |
| HFFc6 | 2 | 4DNESphiT3UBH | 4DNFIF5F4HRG, 4DNFIK82YRNM, 4DNFIATCW955, 4DNFIZU6ADT1, 4DNFIKWV6BY2 |
| HFFc6 | 3 | 4DNESphiT3UBH | 4DNFIFJL4JIH, 4DNFIONHB78N, 4DNFIG1ZOVIM, 4DNFIPKVL9YI, 4DNFIJM966UR, 4DNFIV8JNJB8 |
Hi-C
| Cell line | Biological Replicate (BR) | Accession (Experiemnt set) | Accession (BAM) |
|---|---|---|---|
| G401 | 1 | ENCSR079VIJ | ENCFF649MAY |
| G401 | 2 | ENCSR079VIJ | ENCFF758WUD |
| LNCaP | 1 | ENCSR346DCU | ENCFF977XHB |
| LNCaP | 2 | ENCSR346DCU | ENCFF204XII |
| A549 | 1 | ENCSR444WCZ | ENCFF867DCM |
| A549 | 2 | ENCSR444WCZ | ENCFF532XBC |
-
for the Hi-C data,
bamfiles were downloaded from the ENCODE data portal and converted intopairsfiles using thepairtools parsefunction3pairtools parse --chroms-path hg38.fa.sizes -o <OUT.pairs.gz> --assembly hg38 --no-flip --add-columns mapq --drop-sam --drop-seq --nproc-in 15 --nproc-out 15 <IN.bam> -
for the micro-C data,
pairsof technical replicates (TRs) were merged withpairtools merge. E.g. for H1-hESC, BR1 (4DNES21D8SP8):pairtools merge -o <hESC.BR1.pairs.gz> --nproc 10 4DNFING6ZFDF.pairs.gz 4DNFIBMG8YA3.pairs.gz 4DNFIMT4PHZ1.pairs.gz 4DNFI8GM4EL9.pairs.gz -
40 kb coolers were generated from the Hi-C/micro-C pairs files with
cload pairsfunction4cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 --assembly hg38 <CHRSIZE_FILE:40000> <IN.pairs.gz> <OUT.cool>
- Neumann, J. von., Thermodynamik quantenmechanischer Gesamtheiten. Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen. Mathematisch-Physikalische Klasse 1927. 1927. 273-291.
- Felippe, H., et. al., Threshold-free estimation of entropy from a pearson matrix. EPL. 141(3):31003. 2023.
- Open2C et. al., Pairtools: from sequencing data to chromosome contacts. bioRxiv. 2023.
- Abdennur,N., and Mirny, L.A., Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. 2020.