Co-clustering of chromatin accessibility data across cell types
Project description
chromcocluster
chromcocluster is a Python package for co-clustering chromatin accessibility data across multiple cell types in a manner that reflects a cell type lineage tree.
chromcocluster takes as input a tree describing the lineage structure of a collection of cell types and an accessibility matrix describing genomic accessibility across the cell types and provides as output a clustering of the loci and a clustering of the cell types. Together the locus and cell type clusters decompose the accessibility matrix into a grid of submatrices, a co-clustering, representing subsets of loci with similar accessibility across subsets of cell types.
Locus clustering is achieved through the Louvain algorithm. Cell type clustering is achieved through an algorithm that selects as clusters coherent components of the cell type lineage tree, thereby associating accessibility patterns with lineage structure. Details regarding the algorithms implemented in chromcocluster can be found in
- George, Strawn, and Leviyang. Tree Based Co-Clustering Identifies Chromatin Accessibility Pattens Associated with the Lineage Structure of Hematopoiesis, Bioarxiv..
chromcocluster was written by Thomas George and Sivan Leviyang. If you use chromcluster, please cite the George et al. reference above. See the github page, https://github.com/SLeviyang/chromcocluster
for example input files and for source files.
Please do not hesitate to post any questions or comments to https://github.com/SLeviyang/chromcocluster/issues or email Sivan.Leviyang@georgetown.edu.
Obtaining and Using chromcocluster
chromcocluster is available on PyPl for Python3 (>=v3.6). Installation is easiest through a call to pip.
python3 -m pip install chromcocluster
After installation, import chromcocluster modules into python as described below.
The following packages and external tools are required by chromcocluster:
- Python packages: os, sys, pandas, numpy, random, igraph, matplotlib, seaborn, sknetwork.clustering, scipy.sparse, scipy.stats, multiprocessing, time
- The bedtools suite of utilities must be installed. Download bedtools from
https://bedtools.readthedocs.io/en/latest/
. Bedtools is written by the Quinlan laboratory at the University of Utah.
Input Files
chromcocluster co-clustering requires two csv files as input:
- A csv file containing the accessibility matrix. The accessibility matrix must be binary (i.e. all entries 0 or 1) with rows corresponding to genomic loci and columns to cell types. A 0 and 1 represent an inaccessible and accessible locus, respectively. The csv file must contain a header line giving the cell type (i.e. column) names.
- A csv file containing the edges in the tree representing the lineage structure of the cell types. The csv file should contain exactly two columns, with each row describing an edge. Edges are specified by providing the start and end cell type name. The edge list csv file should contain a header line providing column names, but the column names can be chosen by the user.
Creating the Accessibility Matrix CSV File
The user can use any workflow to create the accessibility matrix. However, for the sake of convenience, chromcocluster includes a module for constructing an accessibility matrix from a collection of bed files. The bed files must be generated by the user, with one bed file for each cell type. Each bed file provides the accessible, genomic loci for the given cell type.
(A typical workflow to generate the bed files starts with ATACseq generated fastq files, uses an aligner such as bowtie to create bam files, and then uses a peak caller such as MACS2 to call peaks and output a bed file for each cell type. The idr tool (https://github.com/nboley/idr
) is useful in merging multiple bed files for a single cell type into a single bed file. See the ENCODE consortium ATACseq workflow for a particular example.)
The bed files must not have a header line and must contain the following information for each locus as columns:
chr
: chromosome (e.g. chr10)chrStart
: start position of the peakchrEnd
: end position of the peaksummit
: location of summit of peak relative to the chrStart value using 0-indexing.q
: quality score, assumed in -log10 form, so that a larger value being better.
The accessibility_matrix
module of chromcocluster can be used to generate the accessibility matrix through the following code.
import chromcocluster.accessibility_matrix as am
a = am.accessibility_matrix("peak_files", [0,1,2,9,8])
a.create_master_peak_list("master.bed", window_radius=250)
a.create_accessibility_matrix("master.bed", "matrix.csv", bedtools_path="/opt/local/bin/bedtools")
-
The first line after the module import constructs an
accessibility_matrix
object. In this case, bed files are in the"peak_files"
directory and the 0, 1, 2, 9, and 8th column of each bed file give chr, chrStart, chrEnd, summit, and q values. All files in the directory with a .bed suffix are identified and associated with a cell type. The cell type names are the bed file names with the .bed suffix dropped. For example, if a bed file isDC.bed
then the corresponding cell type name will beDC
. -
The second line calls the
create_master_peak_list
method, which creates a master list of non-intersecting windows containing all locus summits across all the bed files. All windows are centered at a peak summit and are of size2*window_radius+1
. In this example, every locus summit is enclosed in a 501 base pair window and the method constructs a master list of non-overlapping windows. The non-overlapping windows are then written to a bed file, in this casemaster.bed
. -
The third line constructs the accessibility matrix. The method
create_accessibility_matrix
takes as arguments the path to the master list bed file (in this casemaster.bed
), the file name to which the matrix should be written in csv format (in this casematrix.csv
), and the path to the bedtools executable (see above for the bedtools download link). The output accessibility matrix, in this casematrix.csv
, has a row for each non-intersecting window in the master peak list. A particular cell type has a 1 in the row corresponding to a window if one of the cell type's accessible loci intersects with the window.
For convenience, the accessibility_matrix
object contains the following fields:
peaks_list
: a list containing each bed file as a panda data.frame. The data.frame contains only the fields chr, chrStart, chrEnd, summit, and q.m
: the accessibility matrix stored as a numpy array.master_peaks
: the non-intersecting windows stored as a pandas data.frame with columns chr, chrStart, chrEnd, and q and rows ordered to match the rows ofm
.cell_type_names
: a list containing the cell type names, ordered to match the columns ofm
and the elements ofpeaks_list
.
Creating the Edge List CSV File
The edge list csv file must be provided by the user. As noted above, cell type names should be used to specify edges. The graph specified by the edge list must be a tree, i.e. all nodes have a single parent except for a root node with no parent.
Example Input Files
Example input files based on data collected by Yoshida et al. are available for download at https://github.com/SLeviyang/chromcocluster
.
- Yoshida et al. The cis-Regulatory Atlas of the Mouse Immune System. Cell. 2019.
The bed files formed from the ATACseq dataset of Yoshida et al are provided in the peak_files
folder. An edge list corresponding to the tree in Figure 1A of Yoshida et al is provided in the tree_files
folder. The results presented in George et al. use these files as input to chromcocluster.
Co-Clustering
The co-clustering algorithm is implemented in the cocluster module.
import chromcocluster.cocluster as cclust
cc = cclust.cocluster("edge_list.csv", "matrix.csv")
cc.locus_cluster(FDR=0.001, min_accessible_cell_types=3, max_accessible_cell_types=None, min_cluster_size=30, outfile="locus_clusters.csv")
cc.cell_type_cluster(k=8, nCPU=1, ntrails=10, outfile="cell_type_clusters.csv")
-
The first line after the import constructs a
cocluster
object using the edge list csv and accessibility matrix csv file paths. -
The second line performs locus (i.e. row) clustering. The
FDR
argument modulates the degree to which two loci (i.e. rows) are included in the same cluster. A small FDR will lead to many, small clusters while a large FDR will lead to few, large clusters. The default is0.001
. Rows with less thanmin_accessible_cell_types
1's in the accessibility matrix are thrown out, allowing the user to focus on loci that are accessible in some minimum number of cell types. Rows with more thanmax_accessible_cell_types
are all grouped in one cluster, allowing the user to group together loci that are accessible across a large range of cell types. PassingNone
to this parameter will result in the algorithm setting the value to the total number of cell types minus'min_accessible_cell_types
.min_cluster_size
specifies the minimum number of rows that must be in a cluster. The default is30
. The locus clustering information can be saved to file through theoutfile
argument. Ifoutfile=None
then the locus clustering is not saved. -
The third line performs the cell type (i.e. column) clustering.
k
specifies the number of clusters,nCPU
allows for parallelization, andntrials
specifies how many optimization trails are to be run. The more trails run, the more likely the algorithm is to find the optimal clustering, but with the price of increasing computation time. The default is10
trials. The cell type clustering information can be saved to file through theoutfile
argument. Ifoutfile=None
then the cell type clustering is not saved.
The cocluster
object has the following fields which are used to access the clustering results.
locus_clusters
: a pandas data.frame with 2 columns: row and cluster. The row column values corresponds to a row (i.e. locus) in the accessibility matrix and the cluster column gives the cluster to which the row belong. Clusters are numbered starting with the 0 cluster.cell_type_clusters
: a numpy array providing the cluster number of each cell type. Cell types are ordered according to the column order in the accessibility matrix csv file, or equivalently thecell_type_names
field in theaccessibility_matrix
object.
The cocluster
object also has the following fields, which provide more information regarding the clustering and are useful in cluster analysis.
m_list
: a list of numpy matrices partitioning the accessibility matrix according to the row clustering.locus_edges
: the edges of the graph passed into the Louvain algorithm. These edges are constructed based on theFDR
argument passed to thelocus_cluster
method. The loci (i.e. rows) of the accessibility matrix form the nodes connected by these edges.
If both the locus and cell type clustering have been saved to file, then the information can be loaded into a cocluster
object to avoid rerunning the clustering,
cc2 = cclust.cocluster("edge_list.csv", "matrix.csv")
cc2.load_clustering(locus_clusters_file="locus_clusters.csv", cell_type_clusters_file="cell_type_clusters.csv")
The cc2
object will then have the clustering information created through the cc
object above.
Several important attributes of the locus (i.e. row) clustering must be kept in mind:
- The intersection of the set of cell names found in the edge list csv file and the accessible matrix csv file (in the header line) is used to co-cluster. All cell type columns of the accessibility matrix not in the intersection are removed prior to co-clustering. Similarly, all nodes in the tree that are not in the intersection are removed prior to co-clustering. Co-clustering will not proceed if the subset of nodes does not form a tree.
- The
locus_clusters
field will typically not contain every row (i.e. locus) of the accessibility matrix. If a row is an outlier in its accessibility pattern relative to all other rows, as modulated by theFDR
, it will not be included in any cluster. Further, themin_accessible_cell_types
parameter setting may throw out many rows (i.e. loci) with low accessibility. - All rows/loci that are accessible in more cell types than specified by
max_accessible_cell_types
are grouped in a single cluster, cluster 0.
Visualization and Statistics
The cocluster
module contains visualization and statistical methods. After creating a cocluster
object and performing locus and cell type clustering, as described above, the following methods can be called:
cc.plot_tree()
cc.heatmap(cocluster_approximation=False, collapse_locus_clusters=False, outfile=None)
- The
plot_tree
method generates a plot of the cell type tree with the cell types colored according to the cell type clustering. Nodes are labeled according to cell type index and cluster. For example, a label 23-5 means that the node is the 23 cell type and is in cluster 5. - The
heatmap
method plots clustered accessibility matrix rows and columns in order of increasing cluster assignment, thereby allowing for visualization of the locus and cell type coclustering. Ifcocluster_approximation
is set to True, then all entries within a cocluster are replaced by the cocluster average. The coclustering approximation is analagous to the mediod of each cluster in kmeans. Ifcollapse_locus_clusters
is True, then the rows of each locus cluster are replaced by a single row with values given by the column means. The heatmap can be saved to a file by providing a file path in theoutfile
argument.
The fraction of total and cell type associated variation captured by the coclustering can be computed through the method R2
.
cc.R2()
The R2
method will return a data.frame with columns locus_cluster, R2_total, R2_cell_type
. The R2_total
and R2_cell_type
columns give the fraction of total and cell type associated variance captured by the coclustering for the given locus cluster. The fraction of the variation is an R-squared value computed through an ANOVA approach in which the accessibility matrix restricted to clustered rows is compared against a matrix in which all entries within each co-cluster are replaced by the mean value of entries in the co-cluster.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file chromcocluster-0.0.5.tar.gz
.
File metadata
- Download URL: chromcocluster-0.0.5.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c6d306aa8f31afd05e0a0d446dfa3a261f06527033865d68d632a8bfdf91221 |
|
MD5 | 23eade8631bb7dd958c689dff8bbe568 |
|
BLAKE2b-256 | c87fea038402f42d9c5b492f219b6cfdfede36e2961e074c5b66fc9f6122369a |
File details
Details for the file chromcocluster-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: chromcocluster-0.0.5-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d44aa57bf63ca8bb298ba09f0a838c3c695866d592da8c054a5e2c796b25952c |
|
MD5 | cd27a220e80bbd97b2e8e79a8a838ce5 |
|
BLAKE2b-256 | 1f9f0f8da317dee87f6f55706ddd76f9482676ababee7c8e97e64017bdcf977c |