Improving the feature density based peak caller with dynamic statistics.
Project description
Host | Downloads |
---|---|
PyPI | |
conda |
F-Seq2
Improving the feature density based peak caller with dynamic statistics
Tag sequencing using high-throughput sequencing technologies are employed to identify specific sequence features such as DNase-seq, ATAC-seq, ChIP-seq, and FAIRE-seq. To intuitively summarize and display individual sequence data as an accurate and interpretable signal, we have developed the original F-Seq GitHub, a software package that generates a continuous tag sequence density estimation allowing identification of biologically meaningful sites whose output can be displayed directly in the UCSC Genome Browser.
F-Seq2 is a complete rewrite of the original version in Python. We designed a new statistical framework and introduced new features to F-Seq to further improve the performance in its second version. F-Seq2 implements a dynamic parameter to conduct local statistical analysis with an underlying “continuous” Poisson distribution. By combining the power of the local test and the KDE, which model the read probability distribution with statistical rigor, we robustly account for local biases and solve ties that occur when ranking candidate summits, making results suitable for irreproducible discovery rate (IDR) analysis.
Citation:
Zhano, N. & Boyle, A.P. TF-Seq2: improving the feature density based peak caller with dynamic statistics. NAR Genom Bioinform. 2021 Feb 23;3(1):lqab012. https://doi.org/10.1093/nargab/lqab012
Table of contents
Installation
Prerequisite: BEDTools.
See here for more details to install F-Seq2.
Usage
fseq2 [-h] [--version]
{callpeak, callpeak_idr, idr}
Available subcommands
Subcommand | Description |
---|---|
callpeak |
F-Seq2 main function to call peaks from alignment results. |
callpeak_idr |
Call peaks and follow by IDR framework with recommended settings. |
idr |
A wrapper for IDR package for customized IDR analysis. |
callpeak
Command line input:
-treatment_file
REQUIRED argument for fseq2. Treatment file(s) in bam or bed format. If specifiy multiple files (separated by space), they are considered as one treatment experiment. See here for more details about input format.
-control_file
Control file(s) corresponding to treatment file(s).
-pe
Paired-end mode. If this flag on, treatment (and control) file(s) are paired-end data, either in format of BAMPE or BEDPE. Default is False to treat all data as single-end. See here for more details about paired-end mode.
-chrom_size_file
A file specify chrom sizes, where each line has one chrom and its size. This is required if output signal format is bigwig. Note if this file is specified, fseq2 only process the chroms in this file. Default is False to process all and cannot output bigwig.
-o
Output directory path. Default is current directory.
-name
Prefix for all output files. This overrides exisiting files. Default is fseq2_result
.
-sig_format
Signal format for reconstructed signal. Available format wig
, bigwig
, np_array
. Note if choose np_array
, arrays
for each chrom are stored in NAME_sig.h5
with chrom
as key, and no gaussian smooth applied. Default is False, without output signal.
-sort_by
Sort peaks and summits by pValue
or chromAndStart
. Default is chromAndStart
.
-standard_narrowpeak
If flag on, NAME_peaks.narrowPeak
is in standard .narrowPeak
format, which contains max pvalue summits rather than all summits for each peak region.
Compatible to visualization on UCSC genome browser and convenient for other downstream softwares.
-v
Verbose output. Default is False.
-f
Fragment size of treatment data. Default is to estimate from data. This determines shift size where offset = fragment_size/2
.
For DNase-seq and ATAC-seq data, set -f 0
.
-l
Feature length for treatment data. Default is 600. Recommend 50 for TF ChIP-seq, 600 for DNase-seq and ATAC-seq, 1000 for histone ChIP-seq.
-fc
Fragment size of control data.
-t
Threshold (standard deviations) to call candidate summits. Default is 4.0. Recommend 4.0 for broad peaks, 8.0 for sharp peaks.
-p_thr
P value threshold. Default is 0.01. Consider to relax it to 0.05 when without control data or calling broad peaks.
To resemble F-Seq1 results, specify -p_thr False
, then filter out peaks whose signalValue
(7th column in .narrowPeak
) below est. threshold.
-q_thr
Q value (FDR) threshold. Default is not set and use p_thr
. If set, only use q_thr
.
-cpus
Number of cores to use. Default is 1.
-tp
Threshold (standard deviations) to call peak regions. Default is 4.0.
-sparse_data
If flag on, statistical test includes 1k region for more accurate background estimation. This can be useful for single-cell data.
-nfr_upper_limit
Nucleosome free region upper limit. Default is 150. Used as window_size and min_distance when -f 0
.
-pe_fragment_size_range
Effective only if -pe
on. Only keep PE fragments whose size within the range to call peaks. Default is False,
without any selection. Useful for ATAC-seq data:
(1) to call peaks on nucleosome free regions, specify: 0 150
(2) to call peaks on nucleosome centers, specify: 150 inf
(3) to call peaks on open chromatin regions, specify: auto
auto
is a filter designed for ATAC-seq open chromatin peak calling where we filter out fragments whose size related to mono-, di-, tri-, and multi-nucleosomes. Size information is taken from the original ATAC-seq paper (Buenrostro et al.). You can design your own auto filter based on specific experiment data by specifying-nucleosome_size
parameter.
-nucleosome_size
Effective only if -pe
on and specify -pe_fragment_size_range auto
. Default is 180, 247, 315, 473, 558, 615
They
are the ATAC-seq PE fragment sizes related to mono-, di-, and tri-nucleosomes. Fragments whose size within the ranges
and above the largest bound (i.e. 615) are filtered out when calling peaks. Change those numbers to design your own auto filter.
-prior_pad_summit
Prior knowledge about peak length which only padded into NAME_summits.narrowPeak
. Default is 0.
Useful for IDR analysis: in callpeak_idr
, we set it to the minimum distance between summits.
-num_peaks
Maximum number of peaks called. Default is not set. If set, overrides p_thr
and q_thr
.
callpeak_idr
Command line input:
Most arguments are shared between callpeak
and callpeak_idr
. Here are the unique ones.
Notice if it is
-
or--
ahead of arguments.--
arguments are from IDR package.-
are from fseq2.
-treatment_file_1
Treatment file in bam or bed format as replicate 1.
-treatment_file_2
Treatment file in bam or bed format as replicate 2.
-control_file_1
Control file in bam or bed format, paired with replicate 1 treament file.
-control_file_2
Control file in bam or bed format, paired with replicate 2 treament file.
-name_1
Prefix for output files for replicate 1 (default=fseq2_result_1
).
-name_2
Prefix for output files for replicate 2 (default=fseq2_result_2
).
-prior_pad_summit
Prior knowledge about peak length which only padded into NAME_summits.narrowPeak
. Default is min distance between summits.
--idr_threshold
Only return peaks with a global idr threshold below this value. Default: report all peaks.
--soft_idr_threshold
Report statistics for peaks with a global idr below this value but return all peaks with an idr below --idr Default: 0.05.
--plot
Plot IDR results. Specify False if no plot. Default is to plot to NAME_1_NAME_2.png
. Can specify other name here.
Notice this is different from original IDR package which is only a flag.
idr
Command line input and output:
See original IDR documentation.
Notice all single letter arguments are removed to avoid conflict with fseq2, e.g. no
-s
, use--samples
Output files and formats
NAME_summits.narrowPeak
BED6+4 format
- chrom
- chromStart
- chromEnd
- name -
NAME_summit_num
, num is sorted by eitherPvalue
orchromAndStart
. - score -
int(10*-log10(pValue))
. - strand -
.
- signalValue - Average treatment signal value given window size.
- pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
- qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
- peak - 0 if no specification of
-prior_pad_summit
.
NAME_peaks.narrowPeak
Similar to summit file except that it can contain multiple summits information.
For 7-10 columns, if multiple summits in a peak, output a comma separated list for each column. This behavior can be
turned off by -standard_narrowpeak
to output single value columns.
- chrom
- chromStart
- chromEnd
- name -
NAME_peak_num
, num is sorted by eitherPvalue
orchromAndStart
. - score - Max
int(10*-log10(pValue))
of all summits. - strand -
.
- signalValue
- pValue
- qValue
- peak - Relative summit position(s) to peak start.
NAME.bw
and NAME.wig
Reconstructed signal files which can be displayed directly in the UCSC Genome Browser.
Recommend bw
for efficient indexing in the browser.
NAME_sig.h5
Reconstructed signal file without any smoothing. Signal is stored for each chrom in np.array
and accessed by key chrom
.
For example:
>>> with h5py.File(NAME_sig.h5, mode='r') as sig_file:
... signal = sig_file['chr1'][:] # read in all signal on chr1
NAME_1_NAME_2_conservative_IDR_thresholded_peaks.narrowPeak
and NAME_1_NAME_2.png
Generated by fseq2 callpeak_idr
. Detailed format information is here.
Examples
DNase-seq data
$ fseq2 callpeak treatment_file.bam -f 0 -l 600 -t 4.0 -v -cpus 10
ATAC-seq data
Paired-end ATAC-seq data, and call peaks on open chromatin regions, without calling on nucleosomes
$ fseq2 callpeak treatment_file.bam -f 0 -l 600 -t 4.0 -pe -nfr_upper_limit 150 -pe_fragment_size_range auto
ChIP-seq data
TF ChIP-seq data
$ fseq2 callpeak treatment_file.bed -control_file control_file.bed -l 50 -t 8.0 -sig_format bigwig -chrom_size_file /path/to/hg19.chrom.sizes -v -cpus 5 -o /path/to/fseq2_output_dir -name CTCF_results
IDR pipeline for TF ChIP-seq data
$ fseq2 callpeak_idr treatment_file_rep1.bam treatment_file_rep2.bam -control_file_1 control_file_rep1.bam -control_file_2 control_file_rep2.bam -l 50 -t 8.0 -chrom_size_file /path/to/hg19.chrom.sizes -v -cpus 3 -o /path/to/fseq2_output_dir
Troubleshooting
1. Install error on mac Mojave:
fatal error: 'ios' file not found
#include "ios"
Solution:
add CFLAGS='-stdlib=libc++'
in front of pip install
$ CFLAGS='-stdlib=libc++' pip install fseq2
2. Memory error
Solution:
try with less CPUs
3. NotImplementedError: "xx" does not appear to be installed or on the path, so this method is disabled. Please install a more recent version of BEDTools and re-import to use this method.
Solution:
update or install bedtools >= 2.29.0
Or
one should copy the binaries in bedtools2/bin/
to either usr/local/bin/
or some other repository for commonly used
UNIX tools in your environment.
4. Warnings when -pe
Mostly likely bam file is not sorted by name.
Solution:
see here
5. Too few peaks after multi-test correction
This may indicate poor data quality.
Solution:
use -p_thr
instead of -q_thr
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fseq2-2.0.4.tar.gz
.
File metadata
- Download URL: fseq2-2.0.4.tar.gz
- Upload date:
- Size: 141.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a99143812aa4597842ee8649d67e5253455d8274259e37b29f98246647c182c |
|
MD5 | e8ede55a69ca7a5761e9ca9b1ec7b3c7 |
|
BLAKE2b-256 | e3885c7784f472d13685dd91533e88f2bc9cd70b6e0215956b4c516b9060ece9 |
File details
Details for the file fseq2-2.0.4-cp312-cp312-macosx_11_0_arm64.whl
.
File metadata
- Download URL: fseq2-2.0.4-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 154.7 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c134c08f915ecb580f62fb75ff66a0fc77f00b1d7e7de7513311c8bf6baffb8e |
|
MD5 | 2ec67f63b728ae89f7ab4818995240da |
|
BLAKE2b-256 | fccdc0f4cc8d3ccb303890e28f0b421d3f453a27987da1a55a9bc28a9955bb47 |