Skip to main content

Fully automatic, iterative K-mer based polishing of genome assemblies

Project description

T2T-Automated-Polishing

CI PyPI GHCR Python Docker License: Public Domain

Fully automatic K-mer based polishing of genome assemblies.

Current version is unpublished. Please cite this paper, Arang Rhie's T2T-Polish Git Repository, and this Git Repository if any of the code shared in this repo is used:

Mc Cartney AM, Shafin K, Alonge M et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods (2022) doi: https://doi.org/10.1038/s41592-022-01440-3

For further details on exact application to T2T-CHM13, read the corresponding section below.

Repository Contents

This repository includes everything needed to run the T2T polishing pipeline:

File Description
t2t_polish/ Modular Python package (CLI, constants, runner, polishing, evaluation, k-cov)
pyproject.toml PEP 517/518 project metadata and build configuration
Dockerfile Docker container definition for portable deployment
tests/ pytest test suite
CHANGELOG.md Release history
CONTRIBUTING.md Development and contribution guide
.github/workflows/ CI, release, PyPI publish, and GHCR Docker publish workflows
legacy/ Previous versions (v2, v3, v4 monolithic) and legacy conda/Singularity/SLURM files

What's New in Version 4

T2T-Polish v4 represents a major reimplementation of the T2T polishing pipeline:

  • Python-based: Complete rewrite in Python for improved speed, error handling, and maintainability
  • DeepVariant Integration: Replaced Racon with GPU-accelerated DeepVariant for more accurate variant calling
  • Automatic QV Assessment: Real-time quality evaluation at each iteration using both Merfin and Merqury
  • Optimized K-mer Coverage: Automatic computation of optimal k-mer coverage using Jellyfish and GenomeScope2
  • Resume Capability: Built-in checkpointing allows resuming from interrupted runs
  • Enhanced Logging: Comprehensive logging with timestamped entries and detailed diagnostics
  • Parallel Evaluation: Simultaneous Merfin and Merqury QV calculations for faster assessment

Description and Best Practices

Auto-Polisher launches an iterative process that allows for more precise K-mer based polishing than typical consensus-only methods. Meryl and Winnowmap2 identify unique k-mers throughout the assembly, and map reads. These reads are extensively filtered in FalconC and Merfin to allow for the best base-level polishes. DeepVariant performs GPU-accelerated variant calling to identify corrections with high precision. Once corrections are made, this process repeats to now re-anchor on k-mers that are now present in the assembly from previous correction. Generally, base-level accuracy peaks at three iterations (the program default).

Genome assembly accuracy is automatically assessed at each iteration using both Merfin and Merqury, providing real-time QV (Quality Value) and completeness metrics. For final assessment, it is highly recommended to use a hybrid k-mer database filtered for k-mers greater than one to obtain the most accurate Merqury QV. The steps for this via Meryl and Merqury can be found here, as recommended by the developer, Arang Rhie. Using incomplete Meryl DBs to assess post auto-polisher can lead to inaccurate Merqury QV estimates.

How to Run (Quick Start)

Version 4.1.2 can now be installed via PyPI. Otherwise, it is installed via GitHub. T2T-Polish is a Python package with the t2t-polish CLI entry point.

Basic Usage

# Install from the repository
pip install .

#Install from PyPI
pip install t2t-polish

# Run diagnostics to check dependencies
t2t-polish diagnostics

# Compute optimal k-mer coverage (recommended first step)
t2t-polish computekcov \
    -r <reads.fastq> \
    -o <output_prefix> \
    -k 21 \
    -t 32 \
    --ploidy haploid

# Run polishing with optimized parameters (automatic k-cov calculation)
t2t-polish polish \
    -d <draft.fasta> \
    -r <reads.fastq> \
    --singularity_sif <path/to/deepvariant.sif> \
    --deepseq_type PACBIO \
    -o AutoPolisher \
    -t 32 \
    -i 3 \
    --optimized

# Or run polishing with manual k-mer coverage parameters
t2t-polish polish \
    -d <draft.fasta> \
    -r <reads.fastq> \
    -m <readmers.meryl> \
    --singularity_sif <path/to/deepvariant.sif> \
    --deepseq_type PACBIO \
    --fitted_hist <genomescope_output/lookup_table.txt> \
    --ideal_dpeak 106.7 \
    -o AutoPolisher \
    -t 32 \
    -i 3

Subcommands

t2t-polish has three subcommands:

  • computekcov - Calculate optimal k-mer coverage and fitted histogram using Jellyfish + GenomeScope2
  • polish - Run iterative polishing with DeepVariant-based variant calling
  • diagnostics - Check tool availability, disk space, and Python packages

For detailed help:

t2t-polish --help
t2t-polish computekcov --help
t2t-polish polish --help

Quick Reference: Common Use Cases

Local workstation (with GPU):

t2t-polish polish -d draft.fasta -r reads.fq \
    --singularity_sif deepvariant.sif --deepseq_type PACBIO \
    --optimized -t 32 -i 3

HPC cluster (SLURM):

# See legacy/T2T_Polish_v4/APv4_SLURM.sh for an example submission script
sbatch my_polish_job.sh

Docker container:

docker run -v $(pwd):/data ghcr.io/pgrady1322/t2t-polish:latest polish \
    -d /data/draft.fasta -r /data/reads.fq \
    --singularity_sif /data/deepvariant.sif --optimized

Resume interrupted run:

t2t-polish polish -d draft.fasta -r reads.fq \
    --singularity_sif deepvariant.sif --resume

Key Features

  • GPU Acceleration: DeepVariant utilizes GPU resources via Singularity for faster variant calling
  • Automatic Resume: Use --resume flag to continue from interrupted runs
  • Real-time QV Assessment: Both Merfin and Merqury evaluations run automatically after each iteration
  • Flexible Input: Accepts FASTA or FASTQ reads (FASTQ recommended for HiFi; FASTA auto-converts with quality scores)
  • DeepVariant Models: Supports WGS, WES, PACBIO, ONT_R104, and HYBRID_PACBIO_ILLUMINA model types

System Requirements

  • GPU: Required for DeepVariant acceleration (NVIDIA GPU with CUDA support). Theoretically, this will work with CPU-enabled DeepVariant, but it has not been tested.
  • RAM: Allocate sufficient memory for read processing (e.g., ~400GB for a Revio flow cell on mammalian genomes)
  • Disk Space: Ensure adequate space for intermediate files and output

Dependencies

Core Dependencies

  • Winnowmap2 - Read mapping with repeat-aware k-mer seeding
  • FalconC - Alignment filtering (available in pbipa package)
  • DeepVariant - GPU-accelerated variant calling (Singularity image required)
  • Meryl v1.3+ - K-mer database operations
  • Merfin v1.0+ - K-mer based variant filtering and QV evaluation
  • Samtools - SAM/BAM manipulation
  • BCFtools - VCF processing and consensus generation

Dependencies for K-mer Coverage Calculation

QV Assessment

  • Merqury - K-mer based assembly evaluation (merqury.sh must be on PATH)

Python Dependencies

  • Python 3.10+
  • pysam
  • tqdm (for progress bars)
  • Standard library: argparse, subprocess, concurrent.futures, logging, pathlib

Installation

PyPI (Python package only)

The simplest way to install the t2t-polish CLI and its Python dependencies:

pip install t2t-polish

This installs the Python package (pysam, tqdm) and the t2t-polish entry point. You are still responsible for installing the external bioinformatics tools listed in Dependencies (Winnowmap, Meryl, Merfin, DeepVariant, etc.) via Conda, modules, or from source.

Conda Environment (recommended for full stack)

A Conda/Mamba environment is the recommended way to get both the Python package and all external bioinformatics dependencies in one step:

# Create environment with external tools from bioconda
mamba create -n t2t-polish -c conda-forge -c bioconda \
    python=3.11 meryl winnowmap samtools bcftools \
    kmer-jellyfish genomescope2 pysam tqdm

conda activate t2t-polish

# Install t2t-polish CLI from PyPI (or from source with: pip install .)
pip install t2t-polish

# Verify everything is available
t2t-polish diagnostics

Tip: Use mamba instead of conda for much faster environment solves. Install it with conda install -n base -c conda-forge mamba.

A legacy YML file from the v4 monolithic release is also available at legacy/T2T_Polish_v4/APv4.yml for reference.

The environment should include:

  • Core tools: Winnowmap, Meryl, Merfin, Samtools, BCFtools, FalconC (pb-falconc)
  • K-mer analysis: Jellyfish, GenomeScope2
  • Python packages: pysam, tqdm
  • QV assessment: Merqury (merqury.sh on PATH)

Container-Based Installation

Docker

Pre-built images are published to the GitHub Container Registry on every release:

# Pull the latest release image from GHCR
docker pull ghcr.io/pgrady1322/t2t-polish:latest

# Or pull a specific version
docker pull ghcr.io/pgrady1322/t2t-polish:4.1.2

# Run with Docker
docker run -it --rm \
    -v $(pwd)/data:/data \
    ghcr.io/pgrady1322/t2t-polish:latest polish \
    -d /data/draft.fasta \
    -r /data/reads.fastq \
    --singularity_sif /data/deepvariant.sif \
    -o /data/AutoPolisher \
    -t 32

You can also build the image locally from the Dockerfile:

docker build -t t2t-polish .

Note: The Docker container includes the t2t-polish pipeline and all dependencies except DeepVariant, which must be provided as a Singularity image.

Singularity

A legacy Singularity definition file is available at legacy/T2T_Polish_v4/APv4_Singularity.def for reference. You can adapt it or build from the Dockerfile:

# Build a Singularity image from GHCR
singularity build t2t-polish.sif docker://ghcr.io/pgrady1322/t2t-polish:latest

# Or build from a local Docker image
docker build -t t2t-polish .
singularity build t2t-polish.sif docker-daemon://t2t-polish:latest

# Run with Singularity
singularity exec t2t-polish.sif t2t-polish polish \
    -d draft.fasta \
    -r reads.fastq \
    --singularity_sif deepvariant_gpu.sif \
    -o AutoPolisher \
    -t 32

The container includes:

  • Complete t2t-polish environment with all dependencies
  • Merfin v1.1 built from source
  • Merqury for QV assessment
  • Optimized for HPC cluster deployment

DeepVariant Singularity Image

DeepVariant must be run via a Singularity container with GPU support. Download the GPU-enabled image:

# Download DeepVariant GPU Singularity image (recommended)
singularity pull docker://google/deepvariant:"${BIN_VERSION}-gpu"

# Or specify a specific version
singularity pull docker://google/deepvariant:1.6.1-gpu

# Or build from Docker
singularity build deepvariant_gpu.sif docker://google/deepvariant:1.6.1-gpu

Note: Ensure your system has NVIDIA GPU drivers and CUDA toolkit installed for GPU acceleration.

HPC/SLURM Example

A legacy SLURM script is available at legacy/T2T_Polish_v4/APv4_SLURM.sh for reference. Here is an updated example:

#!/bin/bash
#SBATCH --job-name=t2t_polish
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 32
#SBATCH --partition=gpu
#SBATCH --qos=general
#SBATCH --mem=200g
#SBATCH -o %x_%j.stdout
#SBATCH -e %x_%j.stderr

t2t-polish polish \
    -d draft.fasta \
    -r reads.fastq \
    --optimized \
    --singularity_sif deepvariant_1.6.1-gpu.sif \
    -t 32 \
    -o AutoPolisher

Submit with:

sbatch my_polish_job.sh

Key SLURM considerations:

  • Request GPU partition for DeepVariant acceleration
  • Allocate sufficient memory (200GB+ for large genomes)
  • Match thread count (-c) with t2t-polish threads (-t)
  • Use --resume flag for long-running jobs that may timeout

Manual Installation

If installing dependencies manually or using HPC modules, ensure all tools are available on PATH. You can verify dependencies by running:

t2t-polish diagnostics

This will check for all required tools and display their versions.

Configuration

The pipeline automatically detects tools on PATH. If tools are installed in non-standard locations, you can modify the tool names in t2t_polish/constants.py:

# Tool names (modify if needed)
WINNOWMAP = "winnowmap"
FALCONC = "falconc"
MERYL = "meryl"
MERFIN = "merfin"
BCFTOOLS = "bcftools"
JELLYFISH = "jellyfish"
GENOMESCOPE = "genomescope2"
SAMTOOLS = "samtools"
MERQURY_SH = "merqury.sh"

These can be changed to full paths if necessary.

Advanced Features

Resume Capability

t2t-polish includes robust checkpoint/resume functionality. If a run is interrupted, you can resume from the last completed step:

# Resume from last checkpoint
t2t-polish polish \
    -d <draft.fasta> \
    -r <reads.fastq> \
    --singularity_sif <deepvariant.sif> \
    --resume

# Resume from a specific step (0=all, 1=Meryl, 2=Winnowmap, 3=FalconC, 4=DeepVariant, 5=Merfin, 6=Consensus)
t2t-polish polish \
    -d <draft.fasta> \
    -r <reads.fastq> \
    --singularity_sif <deepvariant.sif> \
    --resume \
    --resume-from 4

Optimized Mode with Automatic K-mer Coverage

When using --optimized, the pipeline automatically:

  1. Computes optimal k-mer coverage using Jellyfish and GenomeScope2
  2. Generates fitted histogram for Merfin probability calculations
  3. Applies optimal parameters to Merfin polishing
  4. Saves coverage parameters to JSON for reliable resume
t2t-polish polish \
    -d <draft.fasta> \
    -r <reads.fastq> \
    --singularity_sif <deepvariant.sif> \
    --optimized \
    --ploidy diploid \
    -t 64

Quality Assessment

t2t-polish automatically runs both Merfin and Merqury evaluations in parallel after each iteration, providing:

  • QV (Quality Value): Phred-scaled base accuracy
  • Completeness: Percentage of expected k-mers present
  • Per-iteration tracking: Monitor improvement across iterations

Results are written to:

  • <prefix>.QV_Completeness_summary.txt - Final summary of all iterations
  • Per-iteration Merfin outputs: <prefix>.iter_N.consensus.fasta.merfin_hist.txt
  • Per-iteration Merqury outputs: <prefix>.iter_N.consensus_merqury.qv

Logging

Comprehensive logging is built-in:

# Enable detailed logging to file
t2t-polish polish \
    -d <draft.fasta> \
    -r <reads.fastq> \
    --singularity_sif <deepvariant.sif> \
    --log-file polishing.log

# Quiet mode (warnings only to console)
t2t-polish polish \
    -d <draft.fasta> \
    -r <reads.fastq> \
    --singularity_sif <deepvariant.sif> \
    --quiet

Cleanup Options

Control intermediate file retention:

# Automatically clean up intermediate files after each iteration
t2t-polish polish \
    -d <draft.fasta> \
    -r <reads.fastq> \
    --singularity_sif <deepvariant.sif> \
    --cleanup

Output Files

t2t-polish generates organized output with clear naming:

<prefix>.iter_0.consensus.fasta           # Initial draft (iteration 0)
<prefix>.iter_1/                          # Iteration 1 working directory
    ├── iter_1.repet_k15.meryl/          # Repetitive k-mer database
    ├── iter_1.winnowmap.sorted.bam      # Aligned reads
    ├── iter_1.falconc.sorted.bam        # Filtered alignments
    ├── iter_1.deepvariant.vcf.gz        # DeepVariant variants
    ├── iter_1.meryl_db/                 # Draft k-mer database
    ├── iter_1.merfin.polish.vcf         # Merfin-filtered variants
    └── iter_1.consensus.fasta           # Polished consensus
<prefix>.iter_2/                          # Iteration 2 working directory
    └── ...
<prefix>.QV_Completeness_summary.txt      # Final QV/completeness summary
<prefix>.tool_versions.txt                # Tool versions used
<prefix>.kcov.json                        # K-mer coverage parameters (if --optimized)

Troubleshooting

GPU Issues

If DeepVariant fails with GPU errors:

  1. Verify NVIDIA drivers: nvidia-smi
  2. Check CUDA compatibility with DeepVariant version
  3. Ensure Singularity has --nv flag (automatically added by t2t-polish)
  4. Try CPU-only DeepVariant image (slower but more compatible)

Memory Issues

If jobs fail due to memory:

  1. Increase SLURM memory allocation (--mem=400g for large genomes)
  2. Use --meryl-memory to limit Meryl memory usage
  3. Clean up intermediate files with --cleanup

Tool Not Found Errors

Run diagnostics to check dependencies:

t2t-polish diagnostics

For missing tools:

  • Conda: Ensure environment is activated
  • Docker: Tools are pre-installed in container
  • Singularity: All tools except DeepVariant are included
  • Manual: Add tool paths to system PATH or modify t2t_polish/constants.py

Resume Issues

If resume fails:

  1. Check for corrupted BAM/VCF files (t2t-polish validates automatically)
  2. Verify <prefix>.kcov.json exists when using --optimized
  3. Use --resume-from N to skip specific steps
  4. Delete problematic iteration folder and restart

Container-Specific Issues

Docker:

  • Mount volumes correctly: -v $(pwd):/data
  • Provide absolute paths in container: /data/file.fasta

Singularity:

  • Bind mount paths: singularity exec -B /data:/data
  • Check file permissions on HPC shared filesystems

Deployment Recommendations

Choose the deployment method that best fits your environment:

Environment Recommended Method Notes
Local workstation with GPU Conda + pip install . Direct installation, easiest to customize
HPC cluster (SLURM/PBS) Singularity (from Dockerfile) Best for shared environments, reproducible
Cloud computing Docker (Dockerfile) Portable across cloud providers
Testing/Development Conda Fast iteration, easy debugging
Production pipelines Singularity or Docker Reproducible, version-controlled

Choosing Your Setup

Use Conda if:

  • You have admin/sudo access
  • You want to customize tool versions
  • You're developing or testing

Use Singularity if:

  • Running on HPC without root access
  • Need reproducible results across different systems
  • Want to isolate from system libraries

Use Docker if:

  • Running on cloud infrastructure
  • Have root/Docker access
  • Need maximum portability

Use SLURM script if:

  • Submitting to job scheduler
  • Need to queue long-running jobs
  • Running on shared HPC resources

Migrating from Version 3

If you're upgrading from the shell-based v3 pipeline:

Key Differences

Feature Version 3 Version 4
Implementation Bash script Python
Variant Caller Racon DeepVariant (GPU)
QV Assessment Manual post-processing Automatic (Merfin + Merqury)
Resume Limited Full checkpoint support
K-mer Coverage Manual calculation Automatic with --optimized
Input Format GZIP required GZIP not required
Logging Basic stdout Comprehensive with levels

Command Translation

v3 fullauto mode:

# Old (v3)
automated-polishing_v3.sh fullauto -d draft.fasta -r reads.fq.gz -s pb -t 32

# New (v4)
t2t-polish polish -d draft.fasta -r reads.fq \
    --singularity_sif deepvariant.sif --deepseq_type PACBIO \
    --optimized -t 32

v3 optimizedpolish mode:

# Old (v3)
automated-polishing_v3.sh optimizedpolish -d draft.fasta -r reads.fq.gz \
    -s pb --fitted_hist lookup_table.txt --peak 106.7

# New (v4)
t2t-polish polish -d draft.fasta -r reads.fq \
    --singularity_sif deepvariant.sif --deepseq_type PACBIO \
    --fitted_hist lookup_table.txt --ideal_dpeak 106.7

Notable Changes

  • No GZIP requirement: v4 accepts uncompressed FASTQ/FASTA
  • Automatic readmers: When using --optimized, readmers DB is computed automatically
  • Read corrector info: FASTA inputs can specify corrector type (hifiasm, herro, flye) for quality score assignment

Legacy Version 3

The bash-based version 3 is still available in the legacy/ directory for users who prefer the original implementation or don't have GPU access. See legacy/README.md for v3 documentation.

Future Roadmap

  • Multi-GPU support for DeepVariant
  • Integration with cloud computing platforms
  • Advanced QV visualization and reporting
  • Support for additional variant callers

T2T-CHM13 Original Resources

For exact command lines and workflows used to generate the T2T-CHM13v1.0 and T2T-CHM13v1.1 assemblies, please refer to the Methods section in the CHM13-Issues repo. Note that some of the tools have been updated since then, and are tracked on this repo.

This README contains details about applying the automated polishing on general genome assemblies using the latest version 4 implementation.

The original script used in McCartney et al, 2021 has been substantially enhanced through version 3 (shell-based with Racon) and now version 4 (Python-based with DeepVariant). Each version represents significant improvements in speed, accuracy, and usability.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

t2t_polish-4.1.2.tar.gz (43.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

t2t_polish-4.1.2-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file t2t_polish-4.1.2.tar.gz.

File metadata

  • Download URL: t2t_polish-4.1.2.tar.gz
  • Upload date:
  • Size: 43.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for t2t_polish-4.1.2.tar.gz
Algorithm Hash digest
SHA256 56a39a2e62299a898023184fc006f7e96db1f7c0c998c18b0fad96f46e478fcb
MD5 caaf673e684a50b963cc6e937ebb1b6b
BLAKE2b-256 3151497eb3a65f8cb061368312a683419248fedad80662ecc5af4224da5ac0d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for t2t_polish-4.1.2.tar.gz:

Publisher: pypi-publish.yml on pgrady1322/T2T-Polish

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file t2t_polish-4.1.2-py3-none-any.whl.

File metadata

  • Download URL: t2t_polish-4.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for t2t_polish-4.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 21a5398ba52e3a4b94156f13c6e42574d0419799b9dd05eb7a4d1cd75c1f4d77
MD5 bcb5bbff1849c417478305aabb3a2175
BLAKE2b-256 0e593c844561a6e7044105b7305a48cad1ba53cfdf72b4709ff1fb55f025cfeb

See more details on using hashes here.

Provenance

The following attestation bundles were made for t2t_polish-4.1.2-py3-none-any.whl:

Publisher: pypi-publish.yml on pgrady1322/T2T-Polish

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page