To pre-process a set of ChIP-seq samples
MAnorm2_utils is designed to coordinate with MAnorm2, an R package for
differential analysis with ChIP-seq signals between two or more groups of
MAnorm2_utils is primarily used for processing a set
of ChIP-seq samples into a regular table recording the read abundances and
enrichment states of a list of genomic bins in each of these samples.
The primary utility of
MAnorm2_utils comes from the two scripts bound
with it, named
Profiling ChIP-seq signals in reference genomic regions
Given the peak regions and mapping positions of reads of each of a set of
profile_bins comes up with a list of reference
genomic bins (each being enriched for ChIP-seq signals in at least one of the
samples), and deduces the read count as well as enrichment status of each of
the bins in each sample. Refer to MACS for more information about the
technical terms mentioned above.
We recommend MACS 1.4 for identifying peaks for ChIP-seq samples associated
with narrow genomic regions of reads enrichment (e.g., samples for most
transcription factors and histone modifications like H3K4me3 and H3K27ac). In
fact, although having a general applicability,
specifically suited to processing the output files generated by MACS 1.4. For
histone modifications constituting broad enriched domains (e.g., H3K9me3 and
H3K27me3), we recommend SICER as the peak caller.
The following is a sample usage of
profile_bins of the simplest form:
profile_bins --peaks=peak1.bed,peak2.bed \ --reads=read1.bed,read2.bed \ --labs=s1,s2 -n example
profile_bins only recognizes BED-formatted input files. For read
alignment results stored in SAM files, use first
transform them into BED files before calling
profile_bins (BED files
sam2bed have been specifically designed to suit
profile_bins; see also the section below). For BAM-formatted
files, refer to Samtools for converting them into SAM files.
If everything goes smoothly, the command above will generate two files, named
example_profile_bins_log.txt and example_profile_bins.xls,
respectively. The former records the full list of parameter settings for
profile_bins, as well as some summary statistics regarding each
of the supplied ChIP-seq samples. The latter gives the read count and
enrichment status for each deduced reference genomic bin in each sample, and
has a format like the following (data shown here is only for illustration):
To clarify, a genomic bin is “occupied” by a ChIP-seq sample if and only if its middle point is covered by some peak region of the sample.
profile_bins supports a number of parameters for a customized
configuration for deducing reference genomic bins as well as counting the reads
falling in them. Type
profile_bins --help in the command line for a
complete list of these parameters and a brief description of each of them.
Among others, several parameters deserve specific attention:
profile_binsmerges peaks from all the provided ChIP-seq samples into a consensus set of peak regions, and divides up each broad merged peak into consecutive genomic bins. Specify
--typical-bin-sizeto control the size of such genomic bins. Note that the merged peaks having a size comparable to this parameter are left untouched.
The default value of
--typical-bin-size, which is 2000, suits well the ChIP-seq samples of histone modifications. For ChIP-seq samples of transcription factors, setting the parameter to 1000 is recommended.
In cases where summit positions of the supplied peaks are available (e.g., when the peaks are called by using MACS 1.4), you may provide
profile_binswith this information via specifying
--summits. Summit positions will be used to determine an appropriate start point for dividing up a broad merged peak.
Alternatively, you can directly specify a set of genomic regions as the reference bins to profile, by setting
--binsto a BED file. In this case,
profile_binsfocuses on these provided bins and suppresses the peak merging procedure.
--summitsare ignored when
Before being assigned to reference bins, each read (or read pair) is converted into a genomic locus representing the middle point of the underlying DNA fragment. By default,
profile_binstreats the supplied reads as single-end, and shifts downstream the 5’ end of each of them by
--shiftsizeto reach the putative middle point.
--shiftsizedefaults to 100, and may be set to half of the practical DNA fragment size selected in the library preparation process.
--pairedto indicate the reads are paired-end. In this case, middle point of the underlying DNA fragment associated with each read pair could be accurately inferred. Note that two reads from the same ChIP-seq sample are considered as a read pair only if they have exactly the same name (i.e., the 4th column in a BED file).
--shiftsizeis ignored when
--keep-dupcontrols the program’s behavior regarding duplicate reads (or read pairs) potentially resulting from PCR amplification. For single-end reads, two reads are considered as duplicates if their 5’ ends are mapped to the same genomic locus; for paired-end reads, two read pairs are considered as duplicates if their implied DNA fragments occupy the same genomic interval.
profile_binspreserves all the reads (or read pairs) for the counting procedure. For both paired-end reads and deep-sequencing single-end reads, we strongly recommend setting
--keep-dupto 1 to enhance the specificity of downstream analyses. In that case, for each ChIP-seq sample only one read (or read pair) of a set of duplicates is retained for counting. Note also that the output log file records, for each sample, the ratio of reads (or read pairs) that are removed due to
profile_binssupports the idea of using a configuration file to deliver parameters, to avoid repeated typing in the command line. To do that, write a configuration file following the format as demonstrated below, and pass it to
peaks=peak1.bed,peak2.bed reads=read1.bed,read2.bed labs=s1,s2 n=example summits=summit1.bed,summit2.bed paired keep-dup=1
--parameterscould be used in mixture with the other command-line arguments.
Refer to the Manual of MAnorm2_utils for a full specification of the
parameters supported by
Transforming SAM into BED files
sam2bed is designed to coordinate with
profile_bins, since the
latter only accepts BED-formatted files. The simplest form of calling
sam2bed is as follows:
sam2bed -i File.sam -o File.bed
The program will read from the standard input stream if
-i is not
In the vast majority of cases, the default setting of most of the parameters
sam2bed should be used.
The only parameter that may be customized in
--min-qual, which controls the program’s behavior
regarding filtering out the SAM alignment records with a low mapping quality.
sam2bed --help in the command line for a brief description of each
parameter supported by
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size MAnorm2_utils-1.0.0-py2.py3-none-any.whl (36.5 kB)||File type Wheel||Python version py2.py3||Upload date||Hashes View|
|Filename, size MAnorm2_utils-1.0.0.tar.gz (1.7 MB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for MAnorm2_utils-1.0.0-py2.py3-none-any.whl