Skip to main content

Python package designed to estimate sequencing saturation for reduced-representation bisulfite sequencing (RRBS) data.

Project description

🧬 methurator

Python Versions License: MIT Tested with pytest Install with BioConda BioContainer

Methurator is a Python package designed to estimate sequencing saturation for reduced-representation bisulfite sequencing (RRBS) data.

Although optimized for RRBS, methurator can also be used for whole-genome bisulfite sequencing (WGBS) or other genome-wide methylation data (e.g. EMseq). However, this data we advise you to use Preseq package.


📑 Table of Contents


1. Dependencies and Notes

  • methurator uses SAMtools and MethylDackel internally for BAM subsampling, thus they need to be installed.
  • When --genome is provided, the corresponding FASTA file will be automatically fetched and cached.
  • Temporary intermediate files are deleted by default unless --keep-temporary-files is specified.

2. Installation

You can install methurator in several ways:

Option 1: Install via pip

pip install methurator

Option 2: Install via BioConda

conda create -n methurator_env bioconda::methurator
conda activate methurator_env

Option 3: Use the BioContainer

docker pull quay.io/biocontainers/methurator:0.1.5--pyhdfd78af_0
docker run quay.io/biocontainers/methurator:0.1.5--pyhdfd78af_0 methurator -h

3. Quick Start

Step 1 — Downsample BAM files

The downsample command performs BAM downsampling according to the specified percentages and coverage.

methurator downsample --genome hg19 --bam test_data/SRX1631721.markdup.sorted.csorted.bam

This command generates two summary files:

  • CpG summary — number of unique CpGs detected in each downsampled BAM
  • Reads summary — number of reads in each downsampled BAM

Example outputs can be found in tests/data.


Step 2 — Plot the sequencing saturation curve

Use the plot command to visualize sequencing saturation:

methurator plot \
  --cpgs_file tests/data/cpgs_summary.csv \
  --reads_file tests/data/reads_summary.csv

4. Command Reference

downsample command

Argument Description Default
--bam Path to a single .bam file or to multiple ones (e.g. files/*.bam).
--outdir, -o Output directory. ./output
--fasta Path to the reference genome FASTA file. If not provided, it will be automatically downloaded based on --genome.
--genome Genome used for alignment. Available: hg19, hg38, GRCh37, GRCh38, mm10, mm39.
--downsampling-percentages, -ds Comma-separated list of downsampling percentages between 0 and 1 (exclusive). 0.1,0.25,0.5,0.75
--minimum-coverage, -mc Minimum CpG coverage to consider for saturation. Can be a single integer or a list (e.g. 1,3,5). 3
--rrbs If set to True, MethylDackel extract will consider the RRBS nature of the data adding the --keepDupes flag. True
--keep-temporary-files If set, temporary files will be kept after analysis. False
--verbose Enable verbose logging. False
--help , -h Print the help message and exit.
--version Print the package version.

plot command

Argument Description Default
--cpgs_file Path to the CpG coverage summary file.
--reads_file Path to the reads coverage summary file.
--outdir, -o Output directory. ./output
--verbose Enable verbose logging. False
--help , -h Print the help message and exit.
--version Print the package version.

5. Example Workflow

# Step 1: Downsample BAM file
methurator downsample --genome hg19 --bam my_sample.bam

# Step 2: Plot saturation curve
methurator plot \
  --cpgs_file output/cpgs_summary.csv \
  --reads_file output/reads_summary.csv

Finally, you will get (within the output/plots) directory an html file containing the sequencing saturation plot, similarly to the following example (also available as interactive html file here):

Plot preview

6. How do we compute the sequencing saturation?

To calculate the sequencing saturation of an RRBS sample, we adopt the following strategy. For each sample, we downsample it according to 4 different percentages (default: 0.1,0.25,0.5,0.75). Then, we compute the number of unique CpGs covered by at least 3 reads and the number of reads at each downsampling percentage.

We then fit the following curve using the scipy.optimize.curve_fit function:

$$ y = \beta_0 \cdot \arctan(\beta_1 \cdot x) $$

We chose the arctangent function because it exhibits an asymptotic growth similar to sequencing saturation. For large values of $\text{x}$ (as $\text{x} \to \infty$), the asymptote corresponds to the theoretical maximum number of unique CpGs covered by at least 3 reads and can be computed as:

$$ \text{asymptote} = \beta_0 \cdot \frac{\pi}{2} $$

Finally, the sequencing saturation value can be calculated as following:

$$ \text{Saturation} = \frac{\text{Number of unique CpGs (≥3 counts)}}{\text{Asymptote}} $$

This approach allows estimation of the theoretical maximum number of CpGs that can be detected given an infinite sequencing depth, and quantifies how close the sample is to reaching sequencing saturation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

methurator-0.1.6.tar.gz (21.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

methurator-0.1.6-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file methurator-0.1.6.tar.gz.

File metadata

  • Download URL: methurator-0.1.6.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for methurator-0.1.6.tar.gz
Algorithm Hash digest
SHA256 5e8aa0669d246d4388a53e00a8ef62b6077e8acf915d86d32f5fa9cc6f91b88d
MD5 fb470261584049fe6d2484d780f29605
BLAKE2b-256 4c8c21d20ecda53c3852662a64f0b4027d1d627d46d88c77e3511dbf5d5a7ac4

See more details on using hashes here.

Provenance

The following attestation bundles were made for methurator-0.1.6.tar.gz:

Publisher: publish.yml on VIBTOBIlab/methurator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file methurator-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: methurator-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for methurator-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 340d0cf4f42df658a6a7dec8f9e3f94f0d2d76e4271bae578e525e1fe9b06a55
MD5 ffaeb015fbdf4b42ac16ded6c42b6494
BLAKE2b-256 1933a0994b76b45942f93c5392be8f5b5fba5f8e992e247c3e10e53e2181edd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for methurator-0.1.6-py3-none-any.whl:

Publisher: publish.yml on VIBTOBIlab/methurator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page