esloco

Estimating Long-Read Coverage for Whole-Genome Sequencing

These details have not been verified by PyPI

Project links

Repository

Project description

Simulation-guided estimation of local coverage for long-read DNA sequencing (esloco)

esloco is a flexible tool for estimating local coverage in long-read DNA sequencing experiments, designed to support both a priori experimental planning and post hoc interpretation of sequencing results.
Publication (not available yet)

Summary

The illustration above demonstrates the workflow of esloco, which iteratively uses repeated simulations to estimate DNA sequencing coverage over target regions. The process involves two main steps:

Generating target coordinates: Barcoded genomes are generated by converting the reference genome into a single coordinate space. Randomized insertions ([I] mode) or user-defined regions of interest ([ROI] mode) are assigned as target regions.
Simulating reads: For each barcode, reads are drawn from a user-defined or empirical read length distribution and randomly placed until the desired whole-genome coverage is reached. Reads falling into blocked regions are excluded.
Calculating target coverage: Local coverage for each target is calculated based on on-target bases of overlapping reads.

[!NOTE] The simulation assumes a uniform placement of long-reads during DNA sequencing. Referring to the Monte-Carlo method, esloco is based on the idea that random sampling can approximate deterministic outcomes, with target coverage primarily influenced by read length distribution and whole-genome coverage.

Installation

The simulation and all its dependencies can be installed from PyPI using pip via

pip install esloco

or by cloning this repository and manually installing the dependencies mentioned in the .toml file.

Usage

Run the simulation on the command line using

esloco --config <configfile>

After the simulation, some basic plots and an interactive overview can be created automatically from the simulation's output tables using

plot_esloco --config <configfile>

Configuration File

Below are the configuration options for the simulation, divided into three sections: [COMMON], [ROI], and [I]. Mandatory fields marked with (*).

`[COMMON]`

Option	Description	Default
`mode`	Running mode; can be `I` or `ROI`. Corresponding section must be provided in the same configuration file.	*()**
`reference_genome_path`	Path to the reference genome.	*()**
`sequenced_data_path`	Custom FASTA/FASTQ file with reads.	`None`
`output_path`	Output path for simulation results.	`./output/`
`experiment_name`	Name of the experiment.	`default_experiment`
`output_path_plots`	Output path for plots.	output_path
`min_overlap_for_detection`	Minimum overlap required between target region and read to be counted.	`1`
`min_read_lengths`	Minimum length of reads drawn from the generated or provided read length dsitribution.	`1`
`chr_restriction`	Chromosome restriction; `unrestricted` uses all chromosomes, otherwise excludes "M" or "_" chromosomes.	`None`
`barcode_weights`	Changes barcode ratios.	`None`
`n_barcodes`	Number of barcodes (i.e., genomes).	`1`
`iterations`	Number of iterations.	`1`
`scaling`	Scaling factor; set to `0.5` if insertion/ROI is expected on one allele in a diploid genome.	`1.0`
`parallel_jobs`	Number of cores assigned for parallelization; each job performs one iteration individually.	`1`
`coverages`	List of coverages used for the simulation.	`[1]`
`mean_read_lengths`	List of mean read lengths used for the simulation; each combination of coverages and lengths is performed per iteration.	`[1000]`
`blocked_regions_bedpath`	BED file with regions that will be blocked from read generation.	`None`
`no_cov_plots`	Prevents the drawing of coverage plots during the first iteration. `True` significantly speeds up the simulation for small iteration numbers or high (`>25`) coverages.	`False`
`seed`	If defined, previous runs can be exactly reproduced.	`random int`

`[ROI]`

Option	Description	Default
`roi_bedpath`	BED file with regions of interest. Gene IDs containing underscores (`_`) are not supported.	*()**

`[I]`

Option	Description	Default
`insertion_number_distribution`	Distribution of insertion numbers; if not `poisson`, numbers are fixed.	`poisson`
`insertion_length`	Length of the inserted sequence.	`1000`
`bedpath`	BED file that limits the insertion placement.	`None`
`insertion_numbers`	Number of insertions; if `insertion_number_distribution=poisson`, this is the mean of the distribution, and the actual number is drawn randomly.	`5`

Simulation output

Data

The simulation has four main output files:

{experiment_name}_log.log: Log files with key processes, resources, and debugging information.
{experiment_name}_barcode_distribution_table.csv: Barcode distribution per condition and iteration.
{experiment_name}_matches_table.csv: Detections per ROI/insertion, including overlap details.
- [I]: Barcode_{Barcode number}_insertion_{insertion number}_{iteration number} (e.g., Barcode_0_insertion_1_0).
- [ROI]: {roi}_{Barcode number}_{iteration number} (e.g., TRGVA_0_0).
{experiment_name}_summary.csv: Brief summary statistics of detections per ROI/insertion.
{experiment_name}_insertion_locations.bed: Insertion coordinates (Only in [I] mode).

Figures

During the simulation, basic coverage plots are generated for the first iteration of each parameter set to monitor expected coverage and target region locations. Also, if no read length distribution is provided, the generated distribution is also saved in .npy format.

For a more comprehensive analysis, plot_esloco --config {configfile} can generate additional overview plots. These are summarized in an interactive {experiment_name}_report.html.

Click to show exemplary overview of report figures

{experiment_name}_Barplot_Barcode_absolute_numbers:

{experiment_name}_lineplot_otb_matches:

{mean_read_length}_{coverage}_coverage:

Advanced Usage

Custom read length distributions

If there is FASTA/FASTQ file provided in the sequenced_data_path option of the configurationfile.ini, the simulation will use it as its read length distribution, thus dicarding any options provided as mean_read_lengths.

[!TIP] If your read length input files are .fasta.gz instead of .fasta, you can decompress them beforehand to improve simulation speed.

Blocked regions `[COMMON]`

If there is a blocked_regions_bedpath provided, the simulation will use the locations, as defined by the first three columns (chr - start - end), as well as optional id and weight columns. The weight column ultimately determines how strict the blocking of the defined region will be performed. If no weight is assigned, the default of 1, i.e. 100% will be applied. If only 50% of the reads that fall into this region are supposed to be blocked, weight should be set to 0.5. This allows for more complex non-uniformly distributed setups or partial monosomies.

Barcode Weighting `[COMMON]`

Cellular proportions `[COMMON]`

One key feature of the simulation is the option to manually assign weights to individual barcodes, allowing to simulate different cellular compositions and over- or under-representations of specific subsets. These weights can be directly assigned in the configfile. Here, barcode_weights can be set to any directory structure with the barcode_numbers as keys and their assigned proportions as values (e.g. barcode_weights={"0": 10}). In this example, a read drawn at random is 10 times more likely to be originating from/assigned to barcode 0.

[!Note] Weights of individual barcodes are calculated as

$$S_{k} = \frac{W_{k} \cdot N}{D}$$

where S_k is the weighted share of barcode k, and D is the common denominator, given by:

$$D = \left( \sum_{i} W_{i} + N - \lvert W \rvert \right) \cdot N$$

W_i refers to the individual barcode weights specified in the user-defined weights dictionary, N is the total number of barcodes, and ∣W∣ represents the number of barcodes with assigned weights.

In summary, barcodes with assigned weights contribute proportionally, while unweighted barcodes are evenly distributed.

Selective blocking/masking `[COMMON]`

It is also possible to select barcodes that will be affected by the pre-defined Blocked regions. For this, an optional barcode column can be added to the bed file, containing a list of barcodes (e.g. ["0", "1"]).

[!TIP] If the goal is to assign different blocking values to different subsets of barcodes, this can be achieved by adding an id column and multiple rows for the same blocked region, which forces the simulation run each set of blocked coordinates individually.

Full chromosome mode `[COMMON]` `[I]` `[ROI]`

Generally, if any of the BED files provided contain entries with only a chromosome defined and start and stop set to 0, the simulation will use the full chromosome.

Fixed insertion numbers and/or locations `[I]`

It is also possible to fix the number and/or location of insertions. For this, the insertion_number_distribution in the configfile needs to be set to anything else than poisson, which is the default. By doing this, the insertion_numbers option will use the defined value as a fixed value, and not as the mean value of a poisson distribution. For a fixed number of insertions at specific locations, this can be achieved by providing a bedpath in the [I] section of the configfile, which has the exact same number of entries as insertion_numbers.

Scaling up

Sometimes, multiple back-to-back simulations are required. To handle this, esloco can be easily wrapped inside a bash loop that processes a directory containing multiple configuration files:

#!/bin/bash

for config in path/to/configs/*; do
    echo "Processing: $config"
    esloco --config "$config"
    plot_esloco --config "$config"
done

Limitations

Uniform read placement

esloco currently places reads uniformly at random along the genome, implying that all regions are equally likely to be sequenced. In practice, this assumption is violated due to sequencing biases, such as GC content effects, which still exist in long-read technologies. While esloco does not explicitly account for such biases, its support for weighted blocking of regions allows users to incorporate empirical correction factors derived from prior sequencing data, offering a route for customization in bias-aware applications.
Short worker timeout / memory leak warning

In some instances, the simulation will return the following warning: "A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak." to the CLI. Usually, this has happened before when using thousands of barcodes during Insertion [I] mode and only a limited amount of CPUs. Usually, neither the simulation nor its results should be affected by the warning.

Citation & Contribution

If you like this tool, please consider giving it a star. If you are using this tool for your research, please cite our upcoming publication.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

1.0.6

Dec 1, 2025

1.0.5

Sep 11, 2025

1.0.4

Aug 12, 2025

This version

1.0.1

Aug 4, 2025

1.0.0

Aug 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

esloco-1.0.1.tar.gz (33.9 kB view details)

Uploaded Aug 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

esloco-1.0.1-py3-none-any.whl (33.5 kB view details)

Uploaded Aug 4, 2025 Python 3

File details

Details for the file esloco-1.0.1.tar.gz.

File metadata

Download URL: esloco-1.0.1.tar.gz
Upload date: Aug 4, 2025
Size: 33.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for esloco-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`c61c87a84beaa986a103df63ce330270f3a07f613298993b79901e1ec8879db0`
MD5	`a4d5287f1d7ddea0c532346b32bfb838`
BLAKE2b-256	`f634d3fdfe13c97451f6bae9e81d06bdf9e109d05090d7eaa4f277fa0087eae3`

See more details on using hashes here.

File details

Details for the file esloco-1.0.1-py3-none-any.whl.

File metadata

Download URL: esloco-1.0.1-py3-none-any.whl
Upload date: Aug 4, 2025
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for esloco-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`02374b066821988d36419b46abeda3be81ebb2c14cdcedfecec6a71cbdcd7cdb`
MD5	`0ad8b8d79c7e63d90dc57471c0b0314d`
BLAKE2b-256	`a18e1c76a45140076e4de4156b6b52c73a3b8936c047a2df9e521cfc5314f103`

See more details on using hashes here.

esloco 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Simulation-guided estimation of local coverage for long-read DNA sequencing (esloco)

Summary

Installation

Usage

Configuration File

[COMMON]

[ROI]

[I]

Simulation output

Data

Figures

Advanced Usage

Custom read length distributions

Blocked regions [COMMON]

Barcode Weighting [COMMON]

Cellular proportions [COMMON]

Selective blocking/masking [COMMON]

Full chromosome mode [COMMON] [I] [ROI]

Fixed insertion numbers and/or locations [I]

Scaling up

Limitations

Citation & Contribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`[COMMON]`

`[ROI]`

`[I]`

Blocked regions `[COMMON]`

Barcode Weighting `[COMMON]`

Cellular proportions `[COMMON]`

Selective blocking/masking `[COMMON]`

Full chromosome mode `[COMMON]` `[I]` `[ROI]`

Fixed insertion numbers and/or locations `[I]`