Estimating Long-Read Coverage for Whole-Genome Sequencing
Project description
Simulation-guided estimation of local coverage for long-read DNA sequencing (esloco)
esloco is a flexible tool for estimating local coverage in long-read DNA sequencing experiments, designed to support both a priori experimental planning and post hoc interpretation of sequencing results.
Publication (not available yet)
Summary
The illustration above demonstrates the workflow of esloco, which iteratively uses repeated simulations to estimate DNA sequencing coverage over target regions. The process involves two main steps:
-
Generating target coordinates: Barcoded genomes are generated by converting the reference genome into a single coordinate space. Randomized insertions (
[I]mode) or user-defined regions of interest ([ROI]mode) are assigned as target regions. -
Simulating reads: For each barcode, reads are drawn from a user-defined or empirical read length distribution and randomly placed until the desired whole-genome coverage is reached. Reads falling into blocked regions are excluded.
-
Calculating target coverage: Local coverage for each target is calculated based on on-target bases of overlapping reads.
[!NOTE] The simulation assumes a uniform placement of long-reads during DNA sequencing. Referring to the Monte-Carlo method,
eslocois based on the idea that random sampling can approximate deterministic outcomes, with target coverage primarily influenced by read length distribution and whole-genome coverage.
Installation
The simulation and all its dependencies can be installed from PyPI using pip via
pip install esloco
or by cloning this repository and manually installing the dependencies mentioned in the .toml file.
Usage
Run the simulation on the command line using
esloco --config <configfile>
After the simulation, some basic plots and an interactive overview can be created automatically from the simulation's output tables using
plot_esloco --config <configfile>
Configuration File
Below are the configuration options for the simulation, divided into three sections: [COMMON], [ROI], and [I]. Mandatory fields marked with (*).
[COMMON]
| Option | Description | Default |
|---|---|---|
mode |
Running mode; can be I or ROI. Corresponding section must be provided in the same configuration file. |
(*) |
reference_genome_path |
Path to the reference genome. | (*) |
sequenced_data_path |
Custom FASTA/FASTQ file with reads. | None |
output_path |
Output path for simulation results. | ./output/ |
experiment_name |
Name of the experiment. | default_experiment |
output_path_plots |
Output path for plots. | output_path |
min_overlap_for_detection |
Minimum overlap required between target region and read to be counted. | 1 |
min_read_length |
Minimum length of reads drawn from the generated or provided read length dsitribution. | 1 |
chr_restriction |
Chromosome restriction; unrestricted uses all chromosomes, otherwise excludes "M" or "_" chromosomes. |
None |
barcode_weights |
Changes barcode ratios. | None |
n_barcodes |
Number of barcodes (i.e., genomes). | 1 |
iterations |
Number of iterations. | 1 |
scaling |
Scaling factor; set to 0.5 if insertion/ROI is expected on one allele in a diploid genome. |
1.0 |
parallel_jobs |
Number of cores assigned for parallelization; each job performs one iteration individually. | 1 |
coverages |
List of coverages used for the simulation. | [1] |
mean_read_lengths |
List of mean read lengths used for the simulation; each combination of coverages and lengths is performed per iteration. | [10000] |
blocked_regions_bedpath |
BED file with regions that will be blocked from read generation. Optional partial masking using weight column. |
None |
consuming |
If True, reads assigned to masked/blocked regions are still counted as sequencing-yield consuming. Biologically, this represents reads that are sequenced but cannot be used/mapped downstream. |
False |
no_cov_plots |
Prevents the drawing of coverage plots during the first iteration. True significantly speeds up the simulation for small iteration numbers or high (>25) coverages. |
False |
seed |
If defined, previous runs can be exactly reproduced. | random int |
sigma |
Custom sigma for artificial log-normal read distribution reads. | 1 |
[ROI]
| Option | Description | Default |
|---|---|---|
roi_bedpath |
BED file with regions of interest. Gene IDs containing underscores (_) are not supported. |
(*) |
[I]
| Option | Description | Default |
|---|---|---|
insertion_number_distribution |
Distribution of insertion numbers; if not poisson, numbers are fixed. |
None |
insertion_length |
Length of the inserted sequence. | 1000 |
bedpath |
BED file that limits the insertion placement. Optional biased placement using weight column. Without weight, insertions are placed relative to the size of the defined regions. |
None |
insertion_numbers |
Number of insertions; if insertion_number_distribution=poisson, this is the mean of the distribution, and the actual number is drawn randomly. |
5 |
Simulation output
Data
The simulation has four main output files:
-
{experiment_name}_log.log: Log files with key processes, resources, and debugging information. -
{experiment_name}_barcode_distribution_table.csv: Barcode distribution per condition and iteration. -
{experiment_name}_matches_table.csv: Detections per ROI/insertion, including overlap details.-
[I]:Barcode_{Barcode number}_insertion_{insertion number}_{iteration number}(e.g., Barcode_0_insertion_1_0). -
[ROI]:{roi}_{Barcode number}_{iteration number}(e.g., TRGVA_0_0).
-
-
{experiment_name}_summary.csv: Brief summary statistics of detections per ROI/insertion. -
{experiment_name}_insertion_locations.bed: Insertion coordinates (Only in[I]mode).
Figures
During the simulation, basic coverage plots are generated for the first iteration of each parameter set to monitor expected coverage and target region locations. Also, if no read length distribution is provided, the generated distribution is also saved in .npy format.
For a more comprehensive analysis, plot_esloco --config {configfile} can generate additional overview plots. These are summarized in an interactive {experiment_name}_report.html.
Show exemplary overview of report figures
A combination of plotly and kaleido are used for the export of images in
.htmland.svg. However,kaleidocurrently needs to be in a version<1.0.0, where the package comes as a self-contained solution and does not require Chrome.
Advanced Usage
Custom read length distributions
If there is FASTA/FASTQ file provided in the sequenced_data_path option of the configurationfile.ini, the simulation will use it as its read length distribution, thus dicarding any options provided as mean_read_lengths.
[!TIP] If your read length input files are
.fasta.gzinstead of.fasta, you can decompress them beforehand to improve simulation speed.
Blocked regions [COMMON]
If there is a blocked_regions_bedpath provided, the simulation will use the locations, as defined by the first three columns (chrom - start - end), as well as optional id and weight columns. The weight column ultimately determines how strict the blocking of the defined region will be performed. If no weight is assigned, the default of 1, i.e. 100% will be applied. If only 50% of the reads that fall into this region are supposed to be blocked, weight should be set to 0.5. This allows for more complex non-uniformly distributed setups or partial monosomies. When setting custom weights, the columns must have the titles chrom - start - end- weight, otherwise esloco assumes no weight column has been assigned and all regions will be fully blocked.
[!Note] When substantial portions of the genome are blocked,
consuming = Falseintroduces a major runtime cost. This is because many simulated reads are discarded before reaching the global coverage threshold, leading to substantially more read draws in total.
Barcode Weighting [COMMON]
Cellular proportions [COMMON]
One key feature of the simulation is the option to manually assign weights to individual barcodes, allowing to simulate different cellular compositions and over- or under-representations of specific subsets. These weights can be directly assigned in the configfile. Here, barcode_weights can be set to any directory structure with the barcode_numbers as keys and their assigned proportions as values (e.g. barcode_weights={"0": 10}). In this example, a read drawn at random is 10 times more likely to be originating from/assigned to barcode 0.
Show how weights of individual barcodes are calculated
Weights of individual barcodes are calculated as
$$S_{k} = \frac{W_{k} \cdot N}{D}$$
where Sk is the weighted share of barcode k, and D is the common denominator, given by:
$$D = \left( \sum_{i} W_{i} + N - \lvert W \rvert \right) \cdot N$$
Wi refers to the individual barcode weights specified in the user-defined weights dictionary, N is the total number of barcodes, and ∣W∣ represents the number of barcodes with assigned weights.
In summary, barcodes with assigned weights contribute proportionally, while unweighted barcodes are evenly distributed.
Selective blocking/masking [COMMON]
It is also possible to select barcodes that will be affected by the pre-defined Blocked regions. For this, an optional barcode column can be added to the bed file, containing a list of barcodes (e.g. ["0", "1"]).
[!TIP] If the goal is to assign different blocking values to different subsets of barcodes, this can be achieved by adding an
idcolumn and multiple rows for the same blocked region, which forces the simulation run each set of blocked coordinates individually.
Full chromosome mode [COMMON] [I] [ROI]
Generally, if any of the BED files provided contain entries with only a chromosome defined and start and stop set to 0, the simulation will use the full chromosome.
Fixed insertion numbers and/or locations [I]
It is also possible to fix the number and/or location of insertions. For this, the insertion_number_distribution in the configfile needs to be set to anything else than poisson, which is the default. By doing this, the insertion_numbers option will use the defined value as a fixed value, and not as the mean value of a poisson distribution. For a fixed number of insertions at specific locations, this can be achieved by providing a bedpath in the [I] section of the configfile, which has the exact same number of entries as insertion_numbers.
Weighted insertion locations [I]
If a bedpath containing a different number of regions than specified in insertion_numbers is provided, esloco will distribute insertions proportionally to the lengths of the regions in the BED by default. However, if the BED file includes an additional weight column (chrom - start - end- weight), esloco will instead place insertions according to the normalized weights, allowing explicit control over the insertion probabilities for each region.
Scaling up
Sometimes, multiple back-to-back simulations are required. To handle this, esloco can be easily wrapped inside a bash loop that processes a directory containing multiple configuration files:
#!/bin/bash
for config in path/to/configs/*; do
echo "Processing: $config"
esloco --config "$config"
plot_esloco --config "$config"
done
Limitations
-
Uniform read placement
eslococurrently places reads uniformly at random along the genome, implying that all regions are equally likely to be sequenced. In practice, this assumption is violated due to sequencing biases, such as GC content effects, which still exist in long-read technologies. Whileeslocodoes not explicitly account for such biases, its support for weighted blocking of regions allows users to incorporate empirical correction factors derived from prior sequencing data, offering a route for customization in bias-aware applications. -
Short worker timeout / memory leak warning
In some instances, the simulation will return the following warning: "A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak." to the CLI. Usually, this has happened before when using thousands of barcodes during
Insertion [I]mode and only a limited amount of CPUs. Usually, neither the simulation nor its results should be affected by the warning.
Citation & Contribution
If you like this tool, please consider giving it a star. If you are using this tool for your research, please cite our upcoming publication.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file esloco-1.0.6.tar.gz.
File metadata
- Download URL: esloco-1.0.6.tar.gz
- Upload date:
- Size: 36.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0744eb24c3a268dd8cf572251beb1ee6b6566a1e3611eb2d0c15ced4b9e84a41
|
|
| MD5 |
3601c3952aa5a5cce0452c8282aff0b8
|
|
| BLAKE2b-256 |
679b4bbcf86368b03e22968e2519416b82f812f9c2f2e0c8d5a1b5dd83fe365c
|
File details
Details for the file esloco-1.0.6-py3-none-any.whl.
File metadata
- Download URL: esloco-1.0.6-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9fa54fefdf3bcbb5539a17ced25e1a7bb480f0d34c0451e59a6405d1505ed44
|
|
| MD5 |
f209b15b72df1af77a3618935dd54b6f
|
|
| BLAKE2b-256 |
0295c043167865f36414f86aab78c0275227edbd1e1a2c84d0454cf938d8833c
|