Skip to main content

Denoise sequencing data from DEL screens.

Project description

deldenoiser

Command line tool to remove effects of truncated side-products from read count data of a DNA-encoded library (DEL) screen.

Table of Contents

Summary

Sequencing read counts from a DEL screen are used as input. The main output is the list of fitness coefficients for the compounds. For each compound, this is proportional to the surviving fraction during binding assay. The following analysis steps are carried out by deldenoiser command line tool:

  1. Estimate tag imbalance factors from pre-selection read counts. (Only if such data is available.)

  2. Estimate fitness of truncated compounds using post-selection read counts, yields and tag imbalances factors.

  3. Estimate fitness of full-cycle compounds using fitness of truncates.

  4. Estimate clean read counts, i.e. the reads originating fro the full cycle products.

It is assumed that yields of synthesis reactions are known, and the true fitness vector is sparse, i.e. only a small minority of the DEL compounds have significant binding strength.

Note: We use a microfluidics-inspired terminology and refer to the different reactions that are run in parallel in each synthesis cycle as "lanes".

Installation

Option 1: Install to local python environment (requires Python 3.6 or higher) from pypi by running

pip install deldenoiser

Option 2: Install to local python environment from github by running

git clone https://github.com/totient-bio/deldenoiser.git
pip install -e ./deldenoiser

Option 3: Build a local docker image deldenoiser:<commit_hash> by running

git clone https://github.com/totient-bio/deldenoiser.git
cd deldenoiser
make docker_image

Usage

For a complete example, see example/run_deldenoiser_command_line_tool.bash, which reads input files from example/input/ and writes results to example/output/.

Generally, running the command

deldenoiser --design <DEL_design.tsv.gz>  \
            --postselection_readcounts <readcounts_post.tsv.gz>  \
            --output_prefix <prefix> \
            [--dispersion <dispersion>] \
            [--regularization_strength <regularization_strength>] \
            [--yields <yields.tsv.gz>]  \
            [--preselection_readcount <readcounts_pre.tsv.gz>] \
            [--maxiter <maxiter>] \
            [--inner_maxiter <inner_maxiter>] \   
            [--tolerance <tol>] \
            [--parallel_processes <processes>] \
            [--minyield <minyield>] \
            [--maxyield <maxyield>] \
            [--F_init <F_init>] \
            [--max_downsteps <max_downsteps>]

produces 3 files,

  • <prefix>_fullcycleproducts.tsv.gz
  • <prefix>_truncates.tsv.gz
  • <prefix>_tag_imbalance_factors.tsv.gz

Inputs

  1. <DEL_design.tsv>, tab-separated values that encode the number of synthesis cycles and the number of lanes in each cycle, with two columns:

    • cycle: cycle index (1,2,... cmax)
    • lanes: number of lanes in the corresponding cycle (must be >= 1)
  2. <readcounts_post.tsv>, tab-separated values that encode the read counts obtained from sequencing done after the DEL selection steps, with cmax + 1 columns:

    • cycle_1_lane: lane index of cycle 1
    • cycle_2_lane: lane index of cycle 2
    • ...
    • cycle_<cmax>_lane: lane index of cycle cmax
    • readcount: number of reads of the DNA tag that identifies the corresponding lane index combination (non-negative integers)
  3. <prefix>, string (that can include the path) to name the output files.

Optional inputs:

  1. <dispersion>, dispersion parameter for the dispersed Poisson noise, (optional, default: 3.0)

  2. <regularization_strength>, regularization strength parameter, (optional, default: 1.0)

  3. <yields.tsv>, tab-separated values that encode the yields of the reactions during synthesis, with three columns (optional, default: all yields are set to 0.5):

    • cycle: cycle index (1,2,... cmax)
    • lane: lane index (1,2, ... [number of lanes in the corresponding cycle])
    • yield: yield of reaction in the corresponding lane (real number between 0.0 and 1.0)
  4. <readcounts_pre.tsv>, same structre as <readcounts_post.tsv>, but for reads obtained from sequencing done before the DEL selection step, (optional, default: sequencing efficiency is assumed to be uniform accross all sequences.)

  5. <maxiter>: maximum number of coordinate descent iterations during fitting truncates (default = 20)

  6. <inner_maxiter>: maximum number of iterations for each coordinate descent step during fitting truncates (default = 10)

  7. <tol>: tolerance, if the intensity due to truncates changes less than this between consecutive iterations of coordinate descent, the the fitting is stopped, before reaching maxiter number of iterations (default = 0.1)

  8. <processes>: max number of parallel processes to start during fitting truncates (default = number of system CPUs)

  9. minyield: lowest allowed input yield value, yields lower than this get censored to this level during preprocessing (default = 1e-10)

  10. maxyield: highest allowed input yield value, yields higher than this get censored to this level during preprocessing (default = 0.95)

  11. F_init: initial value for truncate fitness (default: internal guess is used)

  12. max_downsteps: max number of allowed iterations when logL is decreasing If it is reached, the optimization terminates. (default = 5)

Outputs

  1. <prefix>_fullcycleproducts.tsv.gz: tab-separated values containing the results about full-cycle products, each identified by their extended lane index combination. The cmax + 3 columns contain

    • cycle_<cid>_lane: lane index of cycle cid = 1,2,... cmax
    • fitness: fitness coefficients
    • clean_reads: posterior mode of clean reads Note: Only records corresponding to non-zero input read counts are printed in this file. Compounds with zero observed reads are implicitly assumed to have zero fitness, and zero clean reads.
  2. <prefix>_truncates.tsv.gz: tab-separated encoding the fitness coefficients of the truncates, each identified by their extended lane index combination. The cmax + 1 columns contain

    • cycle_<cid>_lane: extended lane index (which can take 0 as well, as an indication that the synthesis cycle failed) of cycle cid = 0,1,2,... cmax
    • fitness: fitness coefficient truncated compounds Note: Only records corresponding to truncates that are estimated to have non-zero fitness are printed in this file. The truncates missing from here should be understood to have zero fitness.
  3. <prefix>_tag_imbalance_factors.tsv.gz: tab-separated values containing the estimated tag imbalance factors (bhat) for each cycle and lane. It has 3 columns (the same shape as the optional <yields.tsv[.gz]> input file):

    • cycle: cycle index (1,2,... cmax)
    • lane: lane index (1,2, ... lmax[c])
    • imbalance_factor: imbalance factor of the corresponding cycle and reaction lane

Documentation

  • The publication "Denoising DNA Encoded Library Screens with Sparse Learning" by Peter Komar and Marko Kalinic provides an exposition of the assumptions behind the statistical model of deldenoiser and results of its performance of synthetic and experimental read count data.

    • Preprint on ChemRxiv
    • Peer-reviewed publication submitted to ACS Combinatorial Science
  • API documentation of deldenoiser Python package can be built by cloning the repository and running make docs command from the main directory, containing the Makefile.

  • Developer's notes can be found at development-notes/deldenoiser-development-notes.pdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deldenoiser-2.0.0.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

deldenoiser-2.0.0-py3-none-any.whl (35.5 kB view details)

Uploaded Python 3

File details

Details for the file deldenoiser-2.0.0.tar.gz.

File metadata

  • Download URL: deldenoiser-2.0.0.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.5

File hashes

Hashes for deldenoiser-2.0.0.tar.gz
Algorithm Hash digest
SHA256 73b0b2878c62c6fd9218b1484fd34b4f74e08660fa942601af968188a0d18fa7
MD5 e2bb10f5034732db165759820e2be9af
BLAKE2b-256 04a0dd3a8cc8086b635699ed9a59243defdfa66d20348254feaefb99860ed558

See more details on using hashes here.

File details

Details for the file deldenoiser-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: deldenoiser-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 35.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.5

File hashes

Hashes for deldenoiser-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fbe5f5d78802634818fa1ccb52533b1230d3c7de8d1875e2f4bcc4aa85fca90b
MD5 38c08c40a89d24f900fad30f9504ac32
BLAKE2b-256 5a91a19749a1af3ea30916fc76e79eea2055a04783a501cb68acaeff4427ca92

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page