Skip to main content

Estimating Long-Read Coverage for Whole-Genome Sequencing

Project description

PyPI version


Illustration


Simulation-guided estimation of local coverage for long-read DNA sequencing (esloco)

esloco is a flexible tool for estimating local coverage in long-read DNA sequencing experiments, designed to support both a priori experimental planning and post hoc interpretation of sequencing results.
Publication (not available yet)

Summary

Illustration

The illustration above demonstrates the workflow of esloco, which iteratively uses repeated simulations to estimate DNA sequencing coverage over target regions. The process involves two main steps:

  1. Generating target coordinates: Barcoded genomes are generated by converting the reference genome into a single coordinate space. Randomized insertions ([I] mode) or user-defined regions of interest ([ROI] mode) are assigned as target regions.

  2. Simulating reads: For each barcode, reads are drawn from a user-defined or empirical read length distribution and randomly placed until the desired whole-genome coverage is reached. Reads falling into blocked regions are excluded.

  3. Calculating target coverage: Local coverage for each target is calculated based on on-target bases of overlapping reads.

[!NOTE] The simulation assumes a uniform placement of long-reads during DNA sequencing. Referring to the Monte-Carlo method, esloco is based on the idea that random sampling can approximate deterministic outcomes, with target coverage primarily influenced by read length distribution and whole-genome coverage.


Installation

The simulation and all its dependencies can be installed from PyPI using pip via

pip install esloco

or by cloning this repository and manually installing the dependencies mentioned in the .toml file.

Usage

Run the simulation on the command line using

esloco --config <configfile>

After the simulation, some basic plots and an interactive overview can be created automatically from the simulation's output tables using

plot_esloco --config <configfile>

Configuration File

Below are the configuration options for the simulation, divided into three sections: [COMMON], [ROI], and [I]. Mandatory fields marked with (*).

[COMMON]

Option Description Default
mode Running mode; can be I or ROI. Corresponding section must be provided in the same configuration file. (*)
reference_genome_path Path to the reference genome. (*)
sequenced_data_path Custom FASTA/FASTQ file with reads. None
output_path Output path for simulation results. ./output/
experiment_name Name of the experiment. default_experiment
output_path_plots Output path for plots. output_path
min_overlap_for_detection Minimum overlap required between target region and read to be counted. 1
min_read_length Minimum length of reads drawn from the generated or provided read length dsitribution. 1
chr_restriction Chromosome restriction; unrestricted uses all chromosomes, otherwise excludes "M" or "_" chromosomes. None
barcode_weights Changes barcode ratios. None
n_barcodes Number of barcodes (i.e., genomes). 1
iterations Number of iterations. 1
scaling Scaling factor; set to 0.5 if insertion/ROI is expected on one allele in a diploid genome. 1.0
parallel_jobs Number of cores assigned for parallelization; each job performs one iteration individually. 1
coverages List of coverages used for the simulation. [1]
mean_read_lengths List of mean read lengths used for the simulation; each combination of coverages and lengths is performed per iteration. [10000]
blocked_regions_bedpath BED file with regions that will be blocked from read generation. Optional partial masking using weight column. None
consuming If True, reads assigned to masked/blocked regions are still counted as sequencing-yield consuming. Biologically, this represents reads that are sequenced but cannot be used/mapped downstream. False
no_cov_plots Prevents the drawing of coverage plots during the first iteration. True significantly speeds up the simulation for small iteration numbers or high (>25) coverages. False
seed If defined, previous runs can be exactly reproduced. random int
sigma Custom sigma for artificial log-normal read distribution reads. 1

[ROI]

Option Description Default
roi_bedpath BED file with regions of interest. Gene IDs containing underscores (_) are not supported. (*)

[I]

Option Description Default
insertion_number_distribution Distribution of insertion numbers; if not poisson, numbers are fixed. None
insertion_length Length of the inserted sequence. 1000
bedpath BED file that limits the insertion placement. Optional biased placement using weight column. Without weight, insertions are placed relative to the size of the defined regions. None
insertion_numbers Number of insertions; if insertion_number_distribution=poisson, this is the mean of the distribution, and the actual number is drawn randomly. 5

Simulation output

Data

The simulation has four main output files:

  1. {experiment_name}_log.log: Log files with key processes, resources, and debugging information.

  2. {experiment_name}_barcode_distribution_table.csv: Barcode distribution per condition and iteration.

  3. {experiment_name}_matches_table.csv: Detections per ROI/insertion, including overlap details.

    • [I]: Barcode_{Barcode number}_insertion_{insertion number}_{iteration number} (e.g., Barcode_0_insertion_1_0).

    • [ROI]: {roi}_{Barcode number}_{iteration number} (e.g., TRGVA_0_0).

  4. {experiment_name}_summary.csv: Brief summary statistics of detections per ROI/insertion.

  5. {experiment_name}_insertion_locations.bed: Insertion coordinates (Only in [I] mode).

Figures

During the simulation, basic coverage plots are generated for the first iteration of each parameter set to monitor expected coverage and target region locations. Also, if no read length distribution is provided, the generated distribution is also saved in .npy format.

For a more comprehensive analysis, plot_esloco --config {configfile} can generate additional overview plots. These are summarized in an interactive {experiment_name}_report.html.

Show exemplary overview of report figures

{experiment_name}_Barplot_Barcode_absolute_numbers:
Barplot Absolute Numbers

{experiment_name}_lineplot_otb_matches:
Lineplot

{mean_read_length}_{coverage}_coverage:
Coverage Plot

A combination of plotly and kaleido are used for the export of images in .html and .svg. However, kaleido currently needs to be in a version <1.0.0, where the package comes as a self-contained solution and does not require Chrome.


Advanced Usage

Custom read length distributions

If there is FASTA/FASTQ file provided in the sequenced_data_path option of the configurationfile.ini, the simulation will use it as its read length distribution, thus dicarding any options provided as mean_read_lengths.

[!TIP] If your read length input files are .fasta.gz instead of .fasta, you can decompress them beforehand to improve simulation speed.

Blocked regions [COMMON]

If there is a blocked_regions_bedpath provided, the simulation will use the locations, as defined by the first three columns (chrom - start - end), as well as optional id and weight columns. The weight column ultimately determines how strict the blocking of the defined region will be performed. If no weight is assigned, the default of 1, i.e. 100% will be applied. If only 50% of the reads that fall into this region are supposed to be blocked, weight should be set to 0.5. This allows for more complex non-uniformly distributed setups or partial monosomies. When setting custom weights, the columns must have the titles chrom - start - end- weight, otherwise esloco assumes no weight column has been assigned and all regions will be fully blocked.

[!Note] When substantial portions of the genome are blocked, consuming = False introduces a major runtime cost. This is because many simulated reads are discarded before reaching the global coverage threshold, leading to substantially more read draws in total.

Barcode Weighting [COMMON]

Cellular proportions [COMMON]

One key feature of the simulation is the option to manually assign weights to individual barcodes, allowing to simulate different cellular compositions and over- or under-representations of specific subsets. These weights can be directly assigned in the configfile. Here, barcode_weights can be set to any directory structure with the barcode_numbers as keys and their assigned proportions as values (e.g. barcode_weights={"0": 10}). In this example, a read drawn at random is 10 times more likely to be originating from/assigned to barcode 0.

Show how weights of individual barcodes are calculated
Weights of individual barcodes are calculated as

$$S_{k} = \frac{W_{k} \cdot N}{D}$$

where Sk is the weighted share of barcode k, and D is the common denominator, given by:

$$D = \left( \sum_{i} W_{i} + N - \lvert W \rvert \right) \cdot N$$

Wi​ refers to the individual barcode weights specified in the user-defined weights dictionary, N is the total number of barcodes, and ∣W∣ represents the number of barcodes with assigned weights.

In summary, barcodes with assigned weights contribute proportionally, while unweighted barcodes are evenly distributed.

Selective blocking/masking [COMMON]

It is also possible to select barcodes that will be affected by the pre-defined Blocked regions. For this, an optional barcode column can be added to the bed file, containing a list of barcodes (e.g. ["0", "1"]).

[!TIP] If the goal is to assign different blocking values to different subsets of barcodes, this can be achieved by adding an id column and multiple rows for the same blocked region, which forces the simulation run each set of blocked coordinates individually.

Full chromosome mode [COMMON] [I] [ROI]

Generally, if any of the BED files provided contain entries with only a chromosome defined and start and stop set to 0, the simulation will use the full chromosome.

Fixed insertion numbers and/or locations [I]

It is also possible to fix the number and/or location of insertions. For this, the insertion_number_distribution in the configfile needs to be set to anything else than poisson, which is the default. By doing this, the insertion_numbers option will use the defined value as a fixed value, and not as the mean value of a poisson distribution. For a fixed number of insertions at specific locations, this can be achieved by providing a bedpath in the [I] section of the configfile, which has the exact same number of entries as insertion_numbers.

Weighted insertion locations [I]

If a bedpath containing a different number of regions than specified in insertion_numbers is provided, esloco will distribute insertions proportionally to the lengths of the regions in the BED by default. However, if the BED file includes an additional weight column (chrom - start - end- weight), esloco will instead place insertions according to the normalized weights, allowing explicit control over the insertion probabilities for each region.


Scaling up

Sometimes, multiple back-to-back simulations are required. To handle this, esloco can be easily wrapped inside a bash loop that processes a directory containing multiple configuration files:

#!/bin/bash

for config in path/to/configs/*; do
    echo "Processing: $config"
    esloco --config "$config"
    plot_esloco --config "$config"
done

Limitations

  • Uniform read placement

    esloco currently places reads uniformly at random along the genome, implying that all regions are equally likely to be sequenced. In practice, this assumption is violated due to sequencing biases, such as GC content effects, which still exist in long-read technologies. While esloco does not explicitly account for such biases, its support for weighted blocking of regions allows users to incorporate empirical correction factors derived from prior sequencing data, offering a route for customization in bias-aware applications.

  • Short worker timeout / memory leak warning

    In some instances, the simulation will return the following warning: "A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak." to the CLI. Usually, this has happened before when using thousands of barcodes during Insertion [I] mode and only a limited amount of CPUs. Usually, neither the simulation nor its results should be affected by the warning.

Citation & Contribution

If you like this tool, please consider giving it a star. If you are using this tool for your research, please cite our upcoming publication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esloco-1.0.6.tar.gz (36.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

esloco-1.0.6-py3-none-any.whl (35.9 kB view details)

Uploaded Python 3

File details

Details for the file esloco-1.0.6.tar.gz.

File metadata

  • Download URL: esloco-1.0.6.tar.gz
  • Upload date:
  • Size: 36.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for esloco-1.0.6.tar.gz
Algorithm Hash digest
SHA256 0744eb24c3a268dd8cf572251beb1ee6b6566a1e3611eb2d0c15ced4b9e84a41
MD5 3601c3952aa5a5cce0452c8282aff0b8
BLAKE2b-256 679b4bbcf86368b03e22968e2519416b82f812f9c2f2e0c8d5a1b5dd83fe365c

See more details on using hashes here.

File details

Details for the file esloco-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: esloco-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 35.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for esloco-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c9fa54fefdf3bcbb5539a17ced25e1a7bb480f0d34c0451e59a6405d1505ed44
MD5 f209b15b72df1af77a3618935dd54b6f
BLAKE2b-256 0295c043167865f36414f86aab78c0275227edbd1e1a2c84d0454cf938d8833c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page