Denoise sequencing data from DEL screens.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- POSIX
Programming Language
- Python :: 3

Project description

deldenoiser

Command line tool to remove effects of truncated side-products from read count data of a DNA-encoded library (DEL) screen.

Table of Contents

Summary
Installation
Usage
- Inputs
- Outputs
Documentation

Summary

Sequencing read counts from a DEL screen are used as input. The main output is the list of fitness coefficients for the compounds. For each compound, this is proportional to the surviving fraction during binding assay. The following analysis steps are carried out by deldenoiser command line tool:

Estimate tag imbalance factors from pre-selection read counts. (Only if such data is available.)
Estimate fitness of truncated compounds using post-selection read counts, yields and tag imbalances factors.
Estimate fitness of full-cycle compounds using fitness of truncates.
Estimate clean read counts, i.e. the reads originating fro the full cycle products.

It is assumed that yields of synthesis reactions are known, and the true fitness vector is sparse, i.e. only a small minority of the DEL compounds have significant binding strength.

Note: We use a microfluidics-inspired terminology and refer to the different reactions that are run in parallel in each synthesis cycle as "lanes".

Installation

Option 1: Install to local python environment (requires Python 3.6 or higher) from pypi by running

pip install deldenoiser

Option 2: Install to local python environment from github by running

git clone https://github.com/totient-bio/deldenoiser.git
pip install -e ./deldenoiser

Option 3: Build a local docker image deldenoiser:<commit_hash> by running

git clone https://github.com/totient-bio/deldenoiser.git
cd deldenoiser
make docker_image

Usage

For a complete example, see example/run_deldenoiser_command_line_tool.bash, which reads input files from example/input/ and writes results to example/output/.

Generally, running the command

deldenoiser --design <DEL_design.tsv.gz>  \
            --postselection_readcounts <readcounts_post.tsv.gz>  \
            --output_prefix <prefix> \
            [--dispersion <dispersion>] \
            [--regularization_strength <regularization_strength>] \
            [--yields <yields.tsv.gz>]  \
            [--preselection_readcount <readcounts_pre.tsv.gz>] \
            [--maxiter <maxiter>] \
            [--inner_maxiter <inner_maxiter>] \   
            [--tolerance <tol>] \
            [--parallel_processes <processes>] \
            [--minyield <minyield>] \
            [--maxyield <maxyield>] \
            [--F_init <F_init>] \
            [--max_downsteps <max_downsteps>]

produces 3 files,

<prefix>_fullcycleproducts.tsv.gz
<prefix>_truncates.tsv.gz
<prefix>_tag_imbalance_factors.tsv.gz

Inputs

<DEL_design.tsv>, tab-separated values that encode the number of synthesis cycles and the number of lanes in each cycle, with two columns:
- cycle: cycle index (1,2,... cmax)
- lanes: number of lanes in the corresponding cycle (must be >= 1)
<readcounts_post.tsv>, tab-separated values that encode the read counts obtained from sequencing done after the DEL selection steps, with cmax + 1 columns:
- cycle_1_lane: lane index of cycle 1
- cycle_2_lane: lane index of cycle 2
- ...
- cycle_<cmax>_lane: lane index of cycle cmax
- readcount: number of reads of the DNA tag that identifies the corresponding lane index combination (non-negative integers)
<prefix>, string (that can include the path) to name the output files.

Optional inputs:

<dispersion>, dispersion parameter for the dispersed Poisson noise, (optional, default: 3.0)
<regularization_strength>, regularization strength parameter, (optional, default: 1.0)
<yields.tsv>, tab-separated values that encode the yields of the reactions during synthesis, with three columns (optional, default: all yields are set to 0.5):
- cycle: cycle index (1,2,... cmax)
- lane: lane index (1,2, ... [number of lanes in the corresponding cycle])
- yield: yield of reaction in the corresponding lane (real number between 0.0 and 1.0)
<readcounts_pre.tsv>, same structre as <readcounts_post.tsv>, but for reads obtained from sequencing done before the DEL selection step, (optional, default: sequencing efficiency is assumed to be uniform accross all sequences.)
<maxiter>: maximum number of coordinate descent iterations during fitting truncates (default = 20)
<inner_maxiter>: maximum number of iterations for each coordinate descent step during fitting truncates (default = 10)
<tol>: tolerance, if the intensity due to truncates changes less than this between consecutive iterations of coordinate descent, the the fitting is stopped, before reaching maxiter number of iterations (default = 0.1)
<processes>: max number of parallel processes to start during fitting truncates (default = number of system CPUs)
minyield: lowest allowed input yield value, yields lower than this get censored to this level during preprocessing (default = 1e-10)
maxyield: highest allowed input yield value, yields higher than this get censored to this level during preprocessing (default = 0.95)
F_init: initial value for truncate fitness (default: internal guess is used)
max_downsteps: max number of allowed iterations when logL is decreasing If it is reached, the optimization terminates. (default = 5)

Outputs

<prefix>_fullcycleproducts.tsv.gz: tab-separated values containing the results about full-cycle products, each identified by their extended lane index combination. The cmax + 3 columns contain
- cycle_<cid>_lane: lane index of cycle cid = 1,2,... cmax
- fitness: fitness coefficients
- clean_reads: posterior mode of clean reads Note: Only records corresponding to non-zero input read counts are printed in this file. Compounds with zero observed reads are implicitly assumed to have zero fitness, and zero clean reads.
<prefix>_truncates.tsv.gz: tab-separated encoding the fitness coefficients of the truncates, each identified by their extended lane index combination. The cmax + 1 columns contain
- cycle_<cid>_lane: extended lane index (which can take 0 as well, as an indication that the synthesis cycle failed) of cycle cid = 0,1,2,... cmax
- fitness: fitness coefficient truncated compounds Note: Only records corresponding to truncates that are estimated to have non-zero fitness are printed in this file. The truncates missing from here should be understood to have zero fitness.
<prefix>_tag_imbalance_factors.tsv.gz: tab-separated values containing the estimated tag imbalance factors (bhat) for each cycle and lane. It has 3 columns (the same shape as the optional <yields.tsv[.gz]> input file):
- cycle: cycle index (1,2,... cmax)
- lane: lane index (1,2, ... lmax[c])
- imbalance_factor: imbalance factor of the corresponding cycle and reaction lane

Documentation

The publication "Denoising DNA Encoded Library Screens with Sparse Learning" by Peter Komar and Marko Kalinic provides an exposition of the assumptions behind the statistical model of deldenoiser and results of its performance of synthetic and experimental read count data.
- Preprint on ChemRxiv
- Peer-reviewed publication submitted to ACS Combinatorial Science
API documentation of deldenoiser Python package can be built by cloning the repository and running make docs command from the main directory, containing the Makefile.
Developer's notes can be found at development-notes/deldenoiser-development-notes.pdf

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- POSIX
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

2.0.0

May 4, 2020

1.0.0

Jan 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deldenoiser-2.0.0.tar.gz (22.5 kB view details)

Uploaded May 4, 2020 Source

Built Distribution

deldenoiser-2.0.0-py3-none-any.whl (35.5 kB view details)

Uploaded May 4, 2020 Python 3

File details

Details for the file deldenoiser-2.0.0.tar.gz.

File metadata

Download URL: deldenoiser-2.0.0.tar.gz
Upload date: May 4, 2020
Size: 22.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.5

File hashes

Hashes for deldenoiser-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`73b0b2878c62c6fd9218b1484fd34b4f74e08660fa942601af968188a0d18fa7`
MD5	`e2bb10f5034732db165759820e2be9af`
BLAKE2b-256	`04a0dd3a8cc8086b635699ed9a59243defdfa66d20348254feaefb99860ed558`

See more details on using hashes here.

File details

Details for the file deldenoiser-2.0.0-py3-none-any.whl.

File metadata

Download URL: deldenoiser-2.0.0-py3-none-any.whl
Upload date: May 4, 2020
Size: 35.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.5

File hashes

Hashes for deldenoiser-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fbe5f5d78802634818fa1ccb52533b1230d3c7de8d1875e2f4bcc4aa85fca90b`
MD5	`38c08c40a89d24f900fad30f9504ac32`
BLAKE2b-256	`5a91a19749a1af3ea30916fc76e79eea2055a04783a501cb68acaeff4427ca92`

See more details on using hashes here.

deldenoiser 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

deldenoiser

Summary

Installation

Usage

Inputs

Outputs

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes