Skip to main content

Estimating Long-Read Coverage for Whole-Genome Sequencing

Project description


Illustration


Simulation-guided estimation of local coverage for long-read DNA sequencing (esloco)

esloco is a flexible tool for estimating local coverage in long-read DNA sequencing experiments, designed to support both a priori experimental planning and post hoc interpretation of sequencing results.
Publication (not available yet)

Summary

Illustration

The illustration above demonstrates the workflow of esloco, which iteratively uses repeated simulations to estimate DNA sequencing coverage over target regions. The process involves two main steps:

  1. Generating target coordinates: Barcoded genomes are generated by converting the reference genome into a single coordinate space. Randomized insertions ([I] mode) or user-defined regions of interest ([ROI] mode) are assigned as target regions.

  2. Simulating reads: For each barcode, reads are drawn from a user-defined or empirical read length distribution and randomly placed until the desired whole-genome coverage is reached. Reads falling into blocked regions are excluded.

  3. Calculating target coverage: Local coverage for each target is calculated based on on-target bases of overlapping reads.

[!NOTE] The simulation assumes a uniform placement of long-reads during DNA sequencing. Referring to the Monte-Carlo method, esloco is based on the idea that random sampling can approximate deterministic outcomes, with target coverage primarily influenced by read length distribution and whole-genome coverage.


Installation

The simulation and all its dependencies can be installed from PyPI using pip via

pip install esloco

or by cloning this repository and manually installing the dependencies mentioned in the .toml file.

Usage

Run the simulation on the command line using

esloco --config <configfile>

After the simulation, some basic plots and an interactive overview can be created automatically from the simulation's output tables using

plot_esloco --config <configfile>

Configuration File

Below are the configuration options for the simulation, divided into three sections: [COMMON], [ROI], and [I]. Mandatory fields marked with (*).

[COMMON]

Option Description Default
mode Running mode; can be I or ROI. Corresponding section must be provided in the same configuration file. (*)
reference_genome_path Path to the reference genome. (*)
sequenced_data_path Custom FASTA/FASTQ file with reads. None
output_path Output path for simulation results. ./output/
experiment_name Name of the experiment. default_experiment
output_path_plots Output path for plots. output_path
min_overlap_for_detection Minimum overlap required between target region and read to be counted. 1
min_read_lengths Minimum length of reads drawn from the generated or provided read length dsitribution. 1
chr_restriction Chromosome restriction; unrestricted uses all chromosomes, otherwise excludes "M" or "_" chromosomes. None
barcode_weights Changes barcode ratios. None
n_barcodes Number of barcodes (i.e., genomes). 1
iterations Number of iterations. 1
scaling Scaling factor; set to 0.5 if insertion/ROI is expected on one allele in a diploid genome. 1.0
parallel_jobs Number of cores assigned for parallelization; each job performs one iteration individually. 1
coverages List of coverages used for the simulation. [1]
mean_read_lengths List of mean read lengths used for the simulation; each combination of coverages and lengths is performed per iteration. [1000]
blocked_regions_bedpath BED file with regions that will be blocked from read generation. None
no_cov_plots Prevents the drawing of coverage plots during the first iteration. True significantly speeds up the simulation for small iteration numbers or high (>25) coverages. False
seed If defined, previous runs can be exactly reproduced. random int

[ROI]

Option Description Default
roi_bedpath BED file with regions of interest. Gene IDs containing underscores (_) are not supported. (*)

[I]

Option Description Default
insertion_number_distribution Distribution of insertion numbers; if not poisson, numbers are fixed. poisson
insertion_length Length of the inserted sequence. 1000
bedpath BED file that limits the insertion placement. None
insertion_numbers Number of insertions; if insertion_number_distribution=poisson, this is the mean of the distribution, and the actual number is drawn randomly. 5

Simulation output

Data

The simulation has four main output files:

  1. {experiment_name}_log.log: Log files with key processes, resources, and debugging information.

  2. {experiment_name}_barcode_distribution_table.csv: Barcode distribution per condition and iteration.

  3. {experiment_name}_matches_table.csv: Detections per ROI/insertion, including overlap details.

    • [I]: Barcode_{Barcode number}_insertion_{insertion number}_{iteration number} (e.g., Barcode_0_insertion_1_0).

    • [ROI]: {roi}_{Barcode number}_{iteration number} (e.g., TRGVA_0_0).

  4. {experiment_name}_summary.csv: Brief summary statistics of detections per ROI/insertion.

  5. {experiment_name}_insertion_locations.bed: Insertion coordinates (Only in [I] mode).

Figures

During the simulation, basic coverage plots are generated for the first iteration of each parameter set to monitor expected coverage and target region locations. Also, if no read length distribution is provided, the generated distribution is also saved in .npy format.

For a more comprehensive analysis, plot_esloco --config {configfile} can generate additional overview plots. These are summarized in an interactive {experiment_name}_report.html.

Click to show exemplary overview of report figures

{experiment_name}_Barplot_Barcode_absolute_numbers:
Barplot Absolute Numbers

{experiment_name}_lineplot_otb_matches:
Lineplot

{mean_read_length}_{coverage}_coverage:
Coverage Plot


Advanced Usage

Custom read length distributions

If there is FASTA/FASTQ file provided in the sequenced_data_path option of the configurationfile.ini, the simulation will use it as its read length distribution, thus dicarding any options provided as mean_read_lengths.

[!TIP] If your read length input files are .fasta.gz instead of .fasta, you can decompress them beforehand to improve simulation speed.

Blocked regions [COMMON]

If there is a blocked_regions_bedpath provided, the simulation will use the locations, as defined by the first three columns (chr - start - end), as well as optional id and weight columns. The weight column ultimately determines how strict the blocking of the defined region will be performed. If no weight is assigned, the default of 1, i.e. 100% will be applied. If only 50% of the reads that fall into this region are supposed to be blocked, weight should be set to 0.5. This allows for more complex non-uniformly distributed setups or partial monosomies.

Barcode Weighting [COMMON]
Cellular proportions [COMMON]

One key feature of the simulation is the option to manually assign weights to individual barcodes, allowing to simulate different cellular compositions and over- or under-representations of specific subsets. These weights can be directly assigned in the configfile. Here, barcode_weights can be set to any directory structure with the barcode_numbers as keys and their assigned proportions as values (e.g. barcode_weights={"0": 10}). In this example, a read drawn at random is 10 times more likely to be originating from/assigned to barcode 0.

[!Note] Weights of individual barcodes are calculated as

$$S_{k} = \frac{W_{k} \cdot N}{D}$$

where Sk is the weighted share of barcode k, and D is the common denominator, given by:

$$D = \left( \sum_{i} W_{i} + N - \lvert W \rvert \right) \cdot N$$

Wi​ refers to the individual barcode weights specified in the user-defined weights dictionary, N is the total number of barcodes, and ∣W∣ represents the number of barcodes with assigned weights.

In summary, barcodes with assigned weights contribute proportionally, while unweighted barcodes are evenly distributed.

Selective blocking/masking [COMMON]

It is also possible to select barcodes that will be affected by the pre-defined Blocked regions. For this, an optional barcode column can be added to the bed file, containing a list of barcodes (e.g. ["0", "1"]).

[!TIP] If the goal is to assign different blocking values to different subsets of barcodes, this can be achieved by adding an id column and multiple rows for the same blocked region, which forces the simulation run each set of blocked coordinates individually.

Full chromosome mode [COMMON] [I] [ROI]

Generally, if any of the BED files provided contain entries with only a chromosome defined and start and stop set to 0, the simulation will use the full chromosome.

Fixed insertion numbers and/or locations [I]

It is also possible to fix the number and/or location of insertions. For this, the insertion_number_distribution in the configfile needs to be set to anything else than poisson, which is the default. By doing this, the insertion_numbers option will use the defined value as a fixed value, and not as the mean value of a poisson distribution. For a fixed number of insertions at specific locations, this can be achieved by providing a bedpath in the [I] section of the configfile, which has the exact same number of entries as insertion_numbers.


Scaling up

Sometimes, multiple back-to-back simulations are required. To handle this, esloco can be easily wrapped inside a bash loop that processes a directory containing multiple configuration files:

#!/bin/bash

for config in path/to/configs/*; do
    echo "Processing: $config"
    esloco --config "$config"
    plot_esloco --config "$config"
done

Limitations

  • Uniform read placement

    esloco currently places reads uniformly at random along the genome, implying that all regions are equally likely to be sequenced. In practice, this assumption is violated due to sequencing biases, such as GC content effects, which still exist in long-read technologies. While esloco does not explicitly account for such biases, its support for weighted blocking of regions allows users to incorporate empirical correction factors derived from prior sequencing data, offering a route for customization in bias-aware applications.

  • Short worker timeout / memory leak warning

    In some instances, the simulation will return the following warning: "A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak." to the CLI. Usually, this has happened before when using thousands of barcodes during Insertion [I] mode and only a limited amount of CPUs. Usually, neither the simulation nor its results should be affected by the warning.

Citation & Contribution

If you like this tool, please consider giving it a star. If you are using this tool for your research, please cite our upcoming publication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esloco-1.0.1.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

esloco-1.0.1-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file esloco-1.0.1.tar.gz.

File metadata

  • Download URL: esloco-1.0.1.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for esloco-1.0.1.tar.gz
Algorithm Hash digest
SHA256 c61c87a84beaa986a103df63ce330270f3a07f613298993b79901e1ec8879db0
MD5 a4d5287f1d7ddea0c532346b32bfb838
BLAKE2b-256 f634d3fdfe13c97451f6bae9e81d06bdf9e109d05090d7eaa4f277fa0087eae3

See more details on using hashes here.

File details

Details for the file esloco-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: esloco-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for esloco-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 02374b066821988d36419b46abeda3be81ebb2c14cdcedfecec6a71cbdcd7cdb
MD5 0ad8b8d79c7e63d90dc57471c0b0314d
BLAKE2b-256 a18e1c76a45140076e4de4156b6b52c73a3b8936c047a2df9e521cfc5314f103

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page