Skip to main content

Correct Read Counts and Analysis of Differently Expressed Regions

Project description

CRADLE

CRADLE (Correcting Read counts and Analysis of DifferentiaLly Expressed regions) is a package that was developed to analyze STARR-seq data. CRADLE removes technical biases from sonication, PCR, mappability and G-quadruplex sturcture, and generates bigwig files with corrected read counts. CRADLE then uses those corrected read counts and detects both activated and repressed enhancers. CRADLE will help find enhancers with better accuracy and credibility.

Please see [CHANGELOG.md] to see what has changed in CRADLE.

DISCLAIMER

CRADLE callPeak subcommand is designed to call peaks using the read counts 'CORRECTED' by either correctBias or correctBias_stored subcommand in CRADLE. CRADLE callPeak subcommand assumes read counts follow a gaussian distribution, so it might not be ideal to use for uncorrected read counts.

Installation

You can install CRADLE either with using pip or git repository.

  1. Using pip
pip install cradle

Recommend to install the newest version.

  1. Using git repository
git clone https://github.com/ReddyLab/CRADLE.git
make install

or, alternatively

git clone https://github.com/ReddyLab/CRADLE.git
pip install build # If the 'build' package isn't already installed
python -m build # Build cradle
pip install dist/*.whl # Install cradle

Dependencies

CRADLE requires

- numpy (>= 1.14.3)
- argparse (>= 1.1)
- py2bit (>= 0.3.0)
- pyBigWig (>= 0.3.11)
- statsmodels (>= 0.8.0)
- scipy (>= 1.0.1)
- matplotlib (>= 1.5.3)
- h5py (>= 2.6.0)
- Cython (>= 0.24.1)

These dependencies will be automatically installed when you install CRADLE either with pip or git repository.

Commands

cradle <correctBias | correctBias_stored | callPeak | normalize | covariates> [options]

1) correctBias

This command corrects technical biases (shear, PCR, mappability, G-quadruplex) from read counts. This command takes bigwig files as input and outputs bigwig files with corrected read counts. Not recommended to run this command when you have large regions (see 'correctBias_stored' below).

Example of running correctBias:

cradle correctBias -ctrlbw ctrl1.bw ctrl2.bw ctrl3.bw
                   -expbw exp1.bw exp2.bw exp3.bw
                   -l 500
                   -r /data/YoungSook/target_region.bed
                   -biasType shear pcr map gquad
                   -genome /data/YoungSook/hg38.2bit
                   -kmer 50
                   -o /data/YoungSook/CRADLE_result
                   -bl /data/YoungSook/blacklist_regions.bed
  • Required Arguments
    • -ctrlbw
      Ctrl bigwig files. Un-normalized files are recommended. Each file name should be spaced
    • -expbw
      Experimental bigwig files. Un-normalized files are recommended. Each file name should be spaced.
    • -l
      Fragment length
    • -r
      Text file that shows regions of analysis. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. ex) chr1\t100\t3000
    • -biasType
      Type of biases you want to correct among 'shear', 'pcr', 'map', 'gquad'. If you want to correct 'shear' and 'pcr' bias, you should type -biasType shear pcr. If you type map, -mapFile and -kmer are required. If you type gquad, -gquadFile is required
    • -genome
      A genome sequence, in .2bit format. For information on downloading a genome go to https://hgdownload.soe.ucsc.edu/downloads.html and select the "Genome sequence files and select annotations (2bit, GTF, GC-content, etc)" link for the genome you want. Then see sections "Files" and "How to Download"

    • -faFile
      The same as -genome. This argument is deprecated and may be removed in a future release. Please use -genome instead.

  • Optional Arguments
    !! Warning !! Some optional arguments are required depending on what you put in required arguments.
    • -binSize
      The size of bin (bp) for correction. If you put '1', it means you want to correct read counts in single-bp resolution. (default=1)
    • -mi
      The minimum number of fragments. Positions that have less fragments than this value are filtered out. default=5 * the number of samples
    • -mapFile
      Mappability file in bigwig format. Required when 'map' is in '-biasType'. See 'Reference' if you want to download human mappability files (36mer, 50mer, 100mer for hg19 and hg38).
    • -kmer
      The length of sequencing reads. If you have paired-end sequencing with 50mer from each end, type '50'. Required when 'map' is in '-biasType'
    • -gquadFile
      Gquadruplex files in bigwig format. Multiple files are allowed. Required when 'gquad' is in '-biasType'. See 'Reference' if you want to download human Gquadruplex files (hg19 and hg38).
    • -o
      Output directory. All corrected bigwig files will be stored here. If the directory doesn't exist, cradle will make the directory. default=CRADLE_correctionResult.
    • -p
      The number of cpus. default=(available cpus)/2
    • -bl
      Text file that shows regions you want to filter out. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. ex) chr1\t1\t100
    • -norm
      Whether normalization is needed for input bigwig files. Choose either 'True' or 'False'. default=True
    • -generateNormBW
      If you want to generate normalized observed bigwig files, type 'True' (only works when '-norm True'). If you don't want, type 'False'. default=False
    • -rngSeed
      Set the seed value for the RNG. This enables repeatable runs. default=None

2) correctBias_stored

This command corrects technical biases (shear, PCR, mappability, G-quadruplex) from read counts when there are precomputed covariate files(.hdf). This command takes bigwig files as input and outputs bigwig files with corrected read counts. This command is much faster than running correctBias. Using 'correctBias_stored' is highly recommended when you have large regions, especially if you have whole genome data. You can download covariate files from synpase.

Example of running correctBias_stored:

cradle correctBias_stored -ctrlbw ctrl1.bw ctrl2.bw ctrl3.bw
                          -expbw exp1.bw exp2.bw exp3.bw
                          -r /data/YoungSook/target_region.bed
                          -biasType shear pcr map gquad
                          -covariDir /data/YoungSook/hg39_fragLen500_kmer50
                          -genome /data/YoungSook/hg38.2bit
                          -kmer 50
                          -o /data/YoungSook/CRADLE_result
                          -bl /data/YoungSook/blacklist_regions.bed
  • Required Arguments

    • -ctrlbw
      Ctrl bigwig files. Un-normalized files are recommended. Each file name should be spaced
    • -expbw
      Experimental bigwig files. Un-normalized files are recommended. Each file name should be spaced.
    • -r
      Text file that shows regions of analysis. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. ex) chr1\t100\t3000
    • -biasType
      Type of biases you want to correct among 'shear', 'pcr', 'map', 'gquad'. If you want to correct 'shear' and 'pcr' bias, you should type -biasType shear pcr.
    • -covariDir
      The directory of hdf files that have covariate values. The directory name of covariate files should be 'refGenome_fragLen(fragment length)_kmer(the length of sequenced reads)' ex) hg38_fragLen300_kmer36
    • -genome
      A genome sequence, in .2bit format. For information on downloading a genome go to https://hgdownload.soe.ucsc.edu/downloads.html and select the "Genome sequence files and select annotations (2bit, GTF, GC-content, etc)" link for the genome you want. Then see sections "Files" and "How to Download"

    • -faFile
      The same as -genome. This argument is deprecated and may be removed in a future release. Please use -genome instead.

  • Optional Arguments

    • -mi
      The minimum number of fragments. Positions that have less fragments than this value are filtered out. default=5 * the number of samples
    • -o
      Output directory. All corrected bigwig files will be stored here. If the directory doesn't exist, cradle will make the directory. default=CRADLE_correctionResult.
    • -p
      The number of cpus. default=(available cpus)/2
    • -bl
      Text file that shows regions you want to filter out. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. ex) chr1\t1\t100
    • -norm
      Whether normalization is needed for input bigwig files. Choose either 'True' or 'False'. default=True
    • -generateNormBW
      If you want to generate normalized observed bigwig files, type 'True' (only works when '-norm True'). If you don't want, type 'False'. default=False
    • -rngSeed
      Set the seed value for the RNG. This enables repeatable runs. default=None

3) callPeak

This command calls activated and repressed peaks with using corrected bigwig files as input.

Example of running callPeak:

cradle callPeak -ctrlbw ctrl1_corrected.bw ctrl2_corrected.bw ctrl3_corrected.bw
                -expbw exp1_corrected.bw exp2_corrected.bw exp3_corrected.bw
                -r /data/YoungSook/target_region.bed
                -fdr 0.05
                -o /data/YoungSook/CRADLE_peakCalling_result
  • Required Arguments

    • -ctrlbw
      Ctrl bigwig files. Corrected bigwig files are recommended. Each file name should be spaced
    • -expbw
      Experimental bigwig files Corrected bigwig files are recommended.. Each file name should be spaced.
    • -r
      Text file that shows regions of analysis. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. ex) chr1\t100\t3000
    • -fdr
      FDR control

  • Optional Arguments

    • -o
      Output directory. All corrected bigwig files will be stored here. If the directory doesn't exist, cradle will make the directory. default=CRADLE_peak_result.
    • -bl
      Text file that shows regions you want to filter out. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. ex) chr1\t1\t100
    • -rbin
      The size of a bin used for defining regions. rbin cannot be smaller than wbin. default = 300
    • -wbin
      The size of a bin used for testing differential activity. wbin cannot be larger than rbin. default = rbin/6
    • -p
      The number of cpus. default=(available cpus)/2
    • -d
      The minimum distance between peaks. Peaks distanced less than this value(bp) are merged. default=10
    • -pl
      Minimum peak length. Peaks with smaller size than this value are filtered out. default=wbin
    • -stat
      Choose a statistical testing: 't-test' for t-test and 'welch' for welch's t-test default=t-test
    • -normCtrlbw
      Normalized observed ctrl bigwig files. The bigwigs normalized from CRADLE (using -generateNormBW in either correctBias or correctBias_stored subcommand) are recommended. If you use this parameter along with -normExpbw, CRADLE will report pseudo log2 fold change in the output
    • -normExpbw
      Normalized observed experimental bigwig files. The bigwigs normalized from CRADLE (using -generateNormBW in either correctBias or correctBias_stored subcommand) are recommended. If you use this parameter along with -normCtrlbw, CRADLE will report pseudo log2 fold change in the output

4) Normalize

This command normalizes samples across different samples (accounting for sequencing depth) and different regions. This command should be used for data that has uneven coverage resulting from any other reasons than biases. For example, STARR-seq from BACs can have different coverage for each BAC region and overlapping BAC regions. This can cause correctBias or correctBias_stored to not efficiently model biases.

Example of running Normalize:

cradle normalize -ctrlbw ctrl1_corrected.bw ctrl2_corrected.bw ctrl3_corrected.bw
                -expbw exp1_corrected.bw exp2_corrected.bw exp3_corrected.bw
                -r /data/YoungSook/target_region.bed
                -o /data/YoungSook/CRADLE_normalize_result
  • Required Arguments

    • -ctrlbw
      Ctrl bigwig files. Corrected bigwig files are recommended. Each file name should be spaced
    • -expbw
      Experimental bigwig files Corrected bigwig files are recommended.. Each file name should be spaced.
    • -r
      Text file that shows regions of analysis. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. If you are suing BACs, please provide the coordinates of BACs without merging regions. ex) chr1\t100\t3000
  • Optional Arguments

    • -o
      Output directory. All corrected bigwig files will be stored here. If the directory doesn't exist, cradle will make the directory. default=CRADLE_normalization.
    • -p
      The number of cpus. default=(available cpus)/2

5) covariates

This command calculates covariate values used when correcting technical biases with the correctBias_stored command. You can use it to calculate covariates for only the biases you need and the regions you need.

Example of running correctBias:

cradle covariates -l 500
                  -r /data/YoungSook/target_region.bed
                  -bl /data/YoungSook/blacklist_regions.bed
                  -biasType shear pcr map gquad
                  -mapFile /data/YoungSook/uniq_hg38_mappability_50mer.bw
                  -kmer 50
                  -gquadFile /data/YoungSook/gquad/GSE63874_Na_K_PDS_plus_hits_intersect_hg38_uniq_K.bw /data/YoungSook/gquad/GSE63874_Na_K_PDS_minus_hits_intersect_hg38_uniq_K.bw
                  -genome /data/YoungSook/hg38.2bit
                  -o /data/YoungSook/hg38_fragLen1000_kmer100
  • Required Arguments
    • -l
      Fragment length
    • -r
      Text file that shows regions of analysis. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. ex) chr1\t100\t3000
    • -biasType
      Type of biases you want to correct among 'shear', 'pcr', 'map', 'gquad'. If you want to correct 'shear' and 'pcr' bias, you should type -biasType shear pcr. If you type map, -mapFile and -kmer are required. If you type gquad, -gquadFile is required
    • -genome
      A genome sequence, in .2bit format. For information on downloading a genome go to https://hgdownload.soe.ucsc.edu/downloads.html and select the "Genome sequence files and select annotations (2bit, GTF, GC-content, etc)" link for the genome you want. Then see sections "Files" and "How to Download"

    • -faFile
      The same as -genome. This argument is deprecated and may be removed in a future release. Please use -genome instead.

  • Optional Arguments
    !! Warning !! Some optional arguments are required depending on what you put in required arguments.
    • -mapFile
      Mappability file in bigwig format. Required when 'map' is in '-biasType'. See 'Reference' if you want to download human mappability files (36mer, 50mer, 100mer for hg19 and hg38).

    • -kmer
      The length of sequencing reads. If you have paired-end sequencing with 50mer from each end, type '50'. Required when 'map' is in '-biasType'

    • -gquadFile
      Gquadruplex files in bigwig format. Multiple files are allowed. Required when 'gquad' is in '-biasType'. See 'Reference' if you want to download human Gquadruplex files (hg19 and hg38).

    • -o
      Output directory. All calculated covariate files will be stored here. If the directory doesn't exist, cradle will make the directory. default=CRADLE_covariates. The output covariate files will be named {output_directory}_{chromosome}.hdf5 (e.g., CRADLE_covariates_chrX.hdf5)

      Note that, to make the files compatible with the CRADLE correctBias_stored step, the directory should be named {genome}_fragLen{fragment length}_kmer{sequencing read count}. For example, hg38_fragLen1000_kmer100.

    • -p
      The number of cpus. default=(available cpus)/2

    • -bl
      Text file that shows regions you want to filter out. Each line in the text file should have chromosome, start site, and end site that are tab-spaced. ex) chr1\t1\t100

Output files

1) correctBias and correctBias_stored.

  1. Corrected bigwigs files of which file name has '_corrected' in the suffix. The number of generated corrected bigwigs files will be the same as the total number of bigwigs files used as input (this includes both control and experimental bigwigs).
  2. PNG files that shows fitting of the model with a subset of traning data. The number on the right bottom is Pearson's coefficient.

2) callPeak

You will get 'CRADLE_peak' as a result file which has the following format:

chr	start	end	name	score	strand	effectSize	inputCount	outputCount	-log(pvalue)	-log(qvalue)	cohen's_d	peusdoLog2FC
chr10	68885501	68885751	chr10:68885501-68885751	.	.	4459	7225	11684	4.88	3.71	4.013389130730448	0.24524608901230838
chr10	68893301	68893426	chr10:68893301-68893426	.	.	6610	20809	27419	3.99	3.08	3.2087396562360446	0.21340296136659967
chr10	68893676	68894161	chr10:68893676-68894161	.	.	-4293	6600	2306	5.86	4.71	-4.852417753392507	-0.38532390381884846
chr10	68898131	68898381	chr10:68898131-68898381	.	.	-7828	21206	13377	5.41	4.51	-4.922941786184907	-0.35940926914056587
.
.
  • The 1st-3rd columns(chr, start, end): genomic coordiantes
  • The 4th,5th colum (score, strand): not applicable in CRADLE
  • The 6th colum (effectSize): effect size calculated by subtracting the mean of experimental read counts from the mean of control read counts.
  • The 7th colum (inputCount): the mean of control read counts.
  • The 8th colum (outputCount): the mean of experimental read counts.
  • The 9-10th colum (-log(pvalue), -log(qvalue)): -log10 of p value and q value. If a p value is zero, we used the maximum of -log(pvalue) values out of the total peaks. The same applies for q values.
  • The 11th column: Cohen's D, standarized effect size. This column will have 'nan' values in the case where there is only one replicate in either -ctrlbw or -expbw.
  • The 12th column: peusdo log2 fold change. This value is calculated by adding an constant to inputCount and outputCount to make the resulting values positive. The constant is determined by taking the max(normalizedInputCount-inputCount, normalizedOutputCount-outputCount). Here, normalizedInputCount and normalizedOutputCount are calculated by using normalized observed control bigwigs and experimental bigwigs—those bigwigs are from optional arguments called -normCtrlbw and -normExpbw, respectively.

3) Normalize

Normalized bigwigs files of which file name has '_normalized' in the suffix. The number of generated corrected bigwigs files will be the same as the total number of bigwigs files used as input (this includes both control and experimental bigwigs).

How to download covariate files

We uploaded pre-computed covariates files for human genome (hg19, hg38). Those files are required to run "correctBias_stored"

  1. Go to synapse
  2. Register with synapse. (You cannot download the files unless you register)
  3. Search covariate files with SynapseID, syn20369503.

How to download human mappability and gquadruplex files

We liftover mappability files[2] and G-quadruplex files[3] from hg19 to hg38. You can download both hg19 and hg38 mappability files and G-quadruplex files.

  1. Go to synapse
  2. Register with synapse. (You cannot download the files unless you register)
  3. Search covariate files with SynapseID, syn20369496.

Tips on running CRADLE

  • We strongly recommend using correctBias_stored when you have large regions because running correctBias might take a long time, especially when the fragment size is over 500. Small differences in fragment and sequenced lengths don't significantly affect correction power, so we recommend downloading covariate files from syanpse and runnning correctBias_stored if you can find fragment and sequenced lengths that are close to your data.

Building a Singularity Image

Singularity Is a container system created for use with scientific and research software. A singularity image definition file is included so users can build their own images.

Using the Singularity Image

Using a singularity image should be much like using standard cradle. The main difference is that you may need to bind the directories containing input data are in if they aren't in your home directory. For example:

singularity run --bind /data cradle.sif correctBias_stored \
    -ctrlbw /data/ctrl_unnormalized.bw \
    -expbw /data/unnormalized/exp_unnormalized.bw \
    -r /data/ref_genome/regionfile_correctionModel \
    -biasType shear pcr map gquad \
    -genome /data/ref_genome/hg38/hg38.2bit \
    -covariDir /data/covariateFiles/hg38_fragLen1000_kmer50 \
    -mi 125 \
    -p 10 \
    -o corrected_bigwigs_50mer/ \
    -bl /data/ref_genome/hg38_filter_out.bed

However, if all your data is in a directory singularity mounts by default you can run the container like an execumtable:

./cradle.sif correctBias_stored \
    -ctrlbw ctrl_unnormalized.bw \
    -expbw unnormalized/exp_unnormalized.bw \
    -r ref_genome/regionfile_correctionModel \
    -biasType shear pcr map gquad \
    -genome ref_genome/hg38/hg38.2bit \
    -covariDir covariateFiles/hg38_fragLen1000_kmer50 \
    -mi 125 \
    -p 10 \
    -o corrected_bigwigs_50mer/ \
    -bl ref_genome/hg38_filter_out.bed

References

  1. DNAShape
    Zhou T, Yang L, Lu Y, Dror I, Dantas Machado AC, Ghane T, Di Felice R, Rohs R.DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic Acids Res. 2013 Jul;41(Web Server issue):W56-62.
  2. Mappability
    Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, Ribeca P. Fast computation and applications of genome mappability. PLoS One. 2012;7(1):e30377.
  3. G-quadruplex sturcture
    Chambers VS, Marsico G, Boutell JM, Di Antonio M, Smith GP, Balasubramanian S. High-throughput sequencing of DNA G-quadruplex structures in the human genome.Nat Biotechnol. 2015 Aug;33(8):877-81.

Cite CRADLE

Kim YS, Johnson GD, Seo J, Barrera A, Cowart TN, Majoros WH, Ochoa A, Allen AS, Reddy TE. Correcting signal biases and detecting regulatory elements in STARR-seq data. Genome Res. 2021 May;31(5):877-889. doi: 10.1101/gr.269209.120. Epub 2021 Mar 15. PMID: 33722938; PMCID: PMC8092017.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CRADLE-0.30.0.tar.gz (1.2 MB view hashes)

Uploaded Source

Built Distribution

CRADLE-0.30.0-cp39-cp39-macosx_10_9_x86_64.whl (851.7 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page