Fully automatic, iterative K-mer based polishing of genome assemblies
Project description
T2T-Automated-Polishing
Fully automatic K-mer based polishing of genome assemblies.
Current version is unpublished. Please cite this paper, Arang Rhie's T2T-Polish Git Repository, and this Git Repository if any of the code shared in this repo is used:
Mc Cartney AM, Shafin K, Alonge M et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods (2022) doi: https://doi.org/10.1038/s41592-022-01440-3
For further details on exact application to T2T-CHM13, read the corresponding section below.
Repository Contents
This repository includes everything needed to run the T2T polishing pipeline:
| File | Description |
|---|---|
t2t_polish/ |
Modular Python package (CLI, constants, runner, polishing, evaluation, k-cov) |
pyproject.toml |
PEP 517/518 project metadata and build configuration |
Dockerfile |
Docker container definition for portable deployment |
tests/ |
pytest test suite |
CHANGELOG.md |
Release history |
CONTRIBUTING.md |
Development and contribution guide |
.github/workflows/ |
CI, release, PyPI publish, and GHCR Docker publish workflows |
legacy/ |
Previous versions (v2, v3, v4 monolithic) and legacy conda/Singularity/SLURM files |
What's New in Version 4
T2T-Polish v4 represents a major reimplementation of the T2T polishing pipeline:
- Python-based: Complete rewrite in Python for improved speed, error handling, and maintainability
- DeepVariant Integration: Replaced Racon with GPU-accelerated DeepVariant for more accurate variant calling
- Automatic QV Assessment: Real-time quality evaluation at each iteration using both Merfin and Merqury
- Optimized K-mer Coverage: Automatic computation of optimal k-mer coverage using Jellyfish and GenomeScope2
- Resume Capability: Built-in checkpointing allows resuming from interrupted runs
- Enhanced Logging: Comprehensive logging with timestamped entries and detailed diagnostics
- Parallel Evaluation: Simultaneous Merfin and Merqury QV calculations for faster assessment
Description and Best Practices
Auto-Polisher launches an iterative process that allows for more precise K-mer based polishing than typical consensus-only methods. Meryl and Winnowmap2 identify unique k-mers throughout the assembly, and map reads. These reads are extensively filtered in FalconC and Merfin to allow for the best base-level polishes. DeepVariant performs GPU-accelerated variant calling to identify corrections with high precision. Once corrections are made, this process repeats to now re-anchor on k-mers that are now present in the assembly from previous correction. Generally, base-level accuracy peaks at three iterations (the program default).
Genome assembly accuracy is automatically assessed at each iteration using both Merfin and Merqury, providing real-time QV (Quality Value) and completeness metrics. For final assessment, it is highly recommended to use a hybrid k-mer database filtered for k-mers greater than one to obtain the most accurate Merqury QV. The steps for this via Meryl and Merqury can be found here, as recommended by the developer, Arang Rhie. Using incomplete Meryl DBs to assess post auto-polisher can lead to inaccurate Merqury QV estimates.
How to Run (Quick Start)
Version 4.1.2 can now be installed via PyPI. Otherwise, it is installed via GitHub. T2T-Polish is a Python package with the t2t-polish CLI entry point.
Basic Usage
# Install from the repository
pip install .
#Install from PyPI
pip install t2t-polish
# Run diagnostics to check dependencies
t2t-polish diagnostics
# Compute optimal k-mer coverage (recommended first step)
t2t-polish computekcov \
-r <reads.fastq> \
-o <output_prefix> \
-k 21 \
-t 32 \
--ploidy haploid
# Run polishing with optimized parameters (automatic k-cov calculation)
t2t-polish polish \
-d <draft.fasta> \
-r <reads.fastq> \
--singularity_sif <path/to/deepvariant.sif> \
--deepseq_type PACBIO \
-o AutoPolisher \
-t 32 \
-i 3 \
--optimized
# Or run polishing with manual k-mer coverage parameters
t2t-polish polish \
-d <draft.fasta> \
-r <reads.fastq> \
-m <readmers.meryl> \
--singularity_sif <path/to/deepvariant.sif> \
--deepseq_type PACBIO \
--fitted_hist <genomescope_output/lookup_table.txt> \
--ideal_dpeak 106.7 \
-o AutoPolisher \
-t 32 \
-i 3
Subcommands
t2t-polish has three subcommands:
computekcov- Calculate optimal k-mer coverage and fitted histogram using Jellyfish + GenomeScope2polish- Run iterative polishing with DeepVariant-based variant callingdiagnostics- Check tool availability, disk space, and Python packages
For detailed help:
t2t-polish --help
t2t-polish computekcov --help
t2t-polish polish --help
Quick Reference: Common Use Cases
Local workstation (with GPU):
t2t-polish polish -d draft.fasta -r reads.fq \
--singularity_sif deepvariant.sif --deepseq_type PACBIO \
--optimized -t 32 -i 3
HPC cluster (SLURM):
# See legacy/T2T_Polish_v4/APv4_SLURM.sh for an example submission script
sbatch my_polish_job.sh
Docker container:
docker run -v $(pwd):/data ghcr.io/pgrady1322/t2t-polish:latest polish \
-d /data/draft.fasta -r /data/reads.fq \
--singularity_sif /data/deepvariant.sif --optimized
Resume interrupted run:
t2t-polish polish -d draft.fasta -r reads.fq \
--singularity_sif deepvariant.sif --resume
Key Features
- GPU Acceleration: DeepVariant utilizes GPU resources via Singularity for faster variant calling
- Automatic Resume: Use
--resumeflag to continue from interrupted runs - Real-time QV Assessment: Both Merfin and Merqury evaluations run automatically after each iteration
- Flexible Input: Accepts FASTA or FASTQ reads (FASTQ recommended for HiFi; FASTA auto-converts with quality scores)
- DeepVariant Models: Supports WGS, WES, PACBIO, ONT_R104, and HYBRID_PACBIO_ILLUMINA model types
System Requirements
- GPU: Required for DeepVariant acceleration (NVIDIA GPU with CUDA support). Theoretically, this will work with CPU-enabled DeepVariant, but it has not been tested.
- RAM: Allocate sufficient memory for read processing (e.g., ~400GB for a Revio flow cell on mammalian genomes)
- Disk Space: Ensure adequate space for intermediate files and output
Dependencies
Core Dependencies
- Winnowmap2 - Read mapping with repeat-aware k-mer seeding
- FalconC - Alignment filtering (available in pbipa package)
- DeepVariant - GPU-accelerated variant calling (Singularity image required)
- Meryl v1.3+ - K-mer database operations
- Merfin v1.0+ - K-mer based variant filtering and QV evaluation
- Samtools - SAM/BAM manipulation
- BCFtools - VCF processing and consensus generation
Dependencies for K-mer Coverage Calculation
- Jellyfish - K-mer counting
- GenomeScope2 - K-mer histogram analysis and coverage estimation
QV Assessment
- Merqury - K-mer based assembly evaluation (merqury.sh must be on PATH)
Python Dependencies
- Python 3.10+
- pysam
- tqdm (for progress bars)
- Standard library: argparse, subprocess, concurrent.futures, logging, pathlib
Installation
PyPI (Python package only)
The simplest way to install the t2t-polish CLI and its Python dependencies:
pip install t2t-polish
This installs the Python package (pysam, tqdm) and the t2t-polish entry point. You are still responsible for installing the external bioinformatics tools listed in Dependencies (Winnowmap, Meryl, Merfin, DeepVariant, etc.) via Conda, modules, or from source.
Conda Environment (recommended for full stack)
A Conda/Mamba environment is the recommended way to get both the Python package and all external bioinformatics dependencies in one step:
# Create environment with external tools from bioconda
mamba create -n t2t-polish -c conda-forge -c bioconda \
python=3.11 meryl winnowmap samtools bcftools \
kmer-jellyfish genomescope2 pysam tqdm
conda activate t2t-polish
# Install t2t-polish CLI from PyPI (or from source with: pip install .)
pip install t2t-polish
# Verify everything is available
t2t-polish diagnostics
Tip: Use
mambainstead ofcondafor much faster environment solves. Install it withconda install -n base -c conda-forge mamba.
A legacy YML file from the v4 monolithic release is also available at legacy/T2T_Polish_v4/APv4.yml for reference.
The environment should include:
- Core tools: Winnowmap, Meryl, Merfin, Samtools, BCFtools, FalconC (pb-falconc)
- K-mer analysis: Jellyfish, GenomeScope2
- Python packages: pysam, tqdm
- QV assessment: Merqury (merqury.sh on PATH)
Container-Based Installation
Docker
Pre-built images are published to the GitHub Container Registry on every release:
# Pull the latest release image from GHCR
docker pull ghcr.io/pgrady1322/t2t-polish:latest
# Or pull a specific version
docker pull ghcr.io/pgrady1322/t2t-polish:4.1.2
# Run with Docker
docker run -it --rm \
-v $(pwd)/data:/data \
ghcr.io/pgrady1322/t2t-polish:latest polish \
-d /data/draft.fasta \
-r /data/reads.fastq \
--singularity_sif /data/deepvariant.sif \
-o /data/AutoPolisher \
-t 32
You can also build the image locally from the Dockerfile:
docker build -t t2t-polish .
Note: The Docker container includes the t2t-polish pipeline and all dependencies except DeepVariant, which must be provided as a Singularity image.
Singularity
A legacy Singularity definition file is available at legacy/T2T_Polish_v4/APv4_Singularity.def for reference. You can adapt it or build from the Dockerfile:
# Build a Singularity image from GHCR
singularity build t2t-polish.sif docker://ghcr.io/pgrady1322/t2t-polish:latest
# Or build from a local Docker image
docker build -t t2t-polish .
singularity build t2t-polish.sif docker-daemon://t2t-polish:latest
# Run with Singularity
singularity exec t2t-polish.sif t2t-polish polish \
-d draft.fasta \
-r reads.fastq \
--singularity_sif deepvariant_gpu.sif \
-o AutoPolisher \
-t 32
The container includes:
- Complete t2t-polish environment with all dependencies
- Merfin v1.1 built from source
- Merqury for QV assessment
- Optimized for HPC cluster deployment
DeepVariant Singularity Image
DeepVariant must be run via a Singularity container with GPU support. Download the GPU-enabled image:
# Download DeepVariant GPU Singularity image (recommended)
singularity pull docker://google/deepvariant:"${BIN_VERSION}-gpu"
# Or specify a specific version
singularity pull docker://google/deepvariant:1.6.1-gpu
# Or build from Docker
singularity build deepvariant_gpu.sif docker://google/deepvariant:1.6.1-gpu
Note: Ensure your system has NVIDIA GPU drivers and CUDA toolkit installed for GPU acceleration.
HPC/SLURM Example
A legacy SLURM script is available at legacy/T2T_Polish_v4/APv4_SLURM.sh for reference. Here is an updated example:
#!/bin/bash
#SBATCH --job-name=t2t_polish
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 32
#SBATCH --partition=gpu
#SBATCH --qos=general
#SBATCH --mem=200g
#SBATCH -o %x_%j.stdout
#SBATCH -e %x_%j.stderr
t2t-polish polish \
-d draft.fasta \
-r reads.fastq \
--optimized \
--singularity_sif deepvariant_1.6.1-gpu.sif \
-t 32 \
-o AutoPolisher
Submit with:
sbatch my_polish_job.sh
Key SLURM considerations:
- Request GPU partition for DeepVariant acceleration
- Allocate sufficient memory (200GB+ for large genomes)
- Match thread count (
-c) with t2t-polish threads (-t) - Use
--resumeflag for long-running jobs that may timeout
Manual Installation
If installing dependencies manually or using HPC modules, ensure all tools are available on PATH. You can verify dependencies by running:
t2t-polish diagnostics
This will check for all required tools and display their versions.
Configuration
The pipeline automatically detects tools on PATH. If tools are installed in non-standard locations, you can modify the tool names in t2t_polish/constants.py:
# Tool names (modify if needed)
WINNOWMAP = "winnowmap"
FALCONC = "falconc"
MERYL = "meryl"
MERFIN = "merfin"
BCFTOOLS = "bcftools"
JELLYFISH = "jellyfish"
GENOMESCOPE = "genomescope2"
SAMTOOLS = "samtools"
MERQURY_SH = "merqury.sh"
These can be changed to full paths if necessary.
Advanced Features
Resume Capability
t2t-polish includes robust checkpoint/resume functionality. If a run is interrupted, you can resume from the last completed step:
# Resume from last checkpoint
t2t-polish polish \
-d <draft.fasta> \
-r <reads.fastq> \
--singularity_sif <deepvariant.sif> \
--resume
# Resume from a specific step (0=all, 1=Meryl, 2=Winnowmap, 3=FalconC, 4=DeepVariant, 5=Merfin, 6=Consensus)
t2t-polish polish \
-d <draft.fasta> \
-r <reads.fastq> \
--singularity_sif <deepvariant.sif> \
--resume \
--resume-from 4
Optimized Mode with Automatic K-mer Coverage
When using --optimized, the pipeline automatically:
- Computes optimal k-mer coverage using Jellyfish and GenomeScope2
- Generates fitted histogram for Merfin probability calculations
- Applies optimal parameters to Merfin polishing
- Saves coverage parameters to JSON for reliable resume
t2t-polish polish \
-d <draft.fasta> \
-r <reads.fastq> \
--singularity_sif <deepvariant.sif> \
--optimized \
--ploidy diploid \
-t 64
Quality Assessment
t2t-polish automatically runs both Merfin and Merqury evaluations in parallel after each iteration, providing:
- QV (Quality Value): Phred-scaled base accuracy
- Completeness: Percentage of expected k-mers present
- Per-iteration tracking: Monitor improvement across iterations
Results are written to:
<prefix>.QV_Completeness_summary.txt- Final summary of all iterations- Per-iteration Merfin outputs:
<prefix>.iter_N.consensus.fasta.merfin_hist.txt - Per-iteration Merqury outputs:
<prefix>.iter_N.consensus_merqury.qv
Logging
Comprehensive logging is built-in:
# Enable detailed logging to file
t2t-polish polish \
-d <draft.fasta> \
-r <reads.fastq> \
--singularity_sif <deepvariant.sif> \
--log-file polishing.log
# Quiet mode (warnings only to console)
t2t-polish polish \
-d <draft.fasta> \
-r <reads.fastq> \
--singularity_sif <deepvariant.sif> \
--quiet
Cleanup Options
Control intermediate file retention:
# Automatically clean up intermediate files after each iteration
t2t-polish polish \
-d <draft.fasta> \
-r <reads.fastq> \
--singularity_sif <deepvariant.sif> \
--cleanup
Output Files
t2t-polish generates organized output with clear naming:
<prefix>.iter_0.consensus.fasta # Initial draft (iteration 0)
<prefix>.iter_1/ # Iteration 1 working directory
├── iter_1.repet_k15.meryl/ # Repetitive k-mer database
├── iter_1.winnowmap.sorted.bam # Aligned reads
├── iter_1.falconc.sorted.bam # Filtered alignments
├── iter_1.deepvariant.vcf.gz # DeepVariant variants
├── iter_1.meryl_db/ # Draft k-mer database
├── iter_1.merfin.polish.vcf # Merfin-filtered variants
└── iter_1.consensus.fasta # Polished consensus
<prefix>.iter_2/ # Iteration 2 working directory
└── ...
<prefix>.QV_Completeness_summary.txt # Final QV/completeness summary
<prefix>.tool_versions.txt # Tool versions used
<prefix>.kcov.json # K-mer coverage parameters (if --optimized)
Troubleshooting
GPU Issues
If DeepVariant fails with GPU errors:
- Verify NVIDIA drivers:
nvidia-smi - Check CUDA compatibility with DeepVariant version
- Ensure Singularity has
--nvflag (automatically added by t2t-polish) - Try CPU-only DeepVariant image (slower but more compatible)
Memory Issues
If jobs fail due to memory:
- Increase SLURM memory allocation (
--mem=400gfor large genomes) - Use
--meryl-memoryto limit Meryl memory usage - Clean up intermediate files with
--cleanup
Tool Not Found Errors
Run diagnostics to check dependencies:
t2t-polish diagnostics
For missing tools:
- Conda: Ensure environment is activated
- Docker: Tools are pre-installed in container
- Singularity: All tools except DeepVariant are included
- Manual: Add tool paths to system PATH or modify
t2t_polish/constants.py
Resume Issues
If resume fails:
- Check for corrupted BAM/VCF files (t2t-polish validates automatically)
- Verify
<prefix>.kcov.jsonexists when using--optimized - Use
--resume-from Nto skip specific steps - Delete problematic iteration folder and restart
Container-Specific Issues
Docker:
- Mount volumes correctly:
-v $(pwd):/data - Provide absolute paths in container:
/data/file.fasta
Singularity:
- Bind mount paths:
singularity exec -B /data:/data - Check file permissions on HPC shared filesystems
Deployment Recommendations
Choose the deployment method that best fits your environment:
| Environment | Recommended Method | Notes |
|---|---|---|
| Local workstation with GPU | Conda + pip install . |
Direct installation, easiest to customize |
| HPC cluster (SLURM/PBS) | Singularity (from Dockerfile) | Best for shared environments, reproducible |
| Cloud computing | Docker (Dockerfile) |
Portable across cloud providers |
| Testing/Development | Conda | Fast iteration, easy debugging |
| Production pipelines | Singularity or Docker | Reproducible, version-controlled |
Choosing Your Setup
Use Conda if:
- You have admin/sudo access
- You want to customize tool versions
- You're developing or testing
Use Singularity if:
- Running on HPC without root access
- Need reproducible results across different systems
- Want to isolate from system libraries
Use Docker if:
- Running on cloud infrastructure
- Have root/Docker access
- Need maximum portability
Use SLURM script if:
- Submitting to job scheduler
- Need to queue long-running jobs
- Running on shared HPC resources
Migrating from Version 3
If you're upgrading from the shell-based v3 pipeline:
Key Differences
| Feature | Version 3 | Version 4 |
|---|---|---|
| Implementation | Bash script | Python |
| Variant Caller | Racon | DeepVariant (GPU) |
| QV Assessment | Manual post-processing | Automatic (Merfin + Merqury) |
| Resume | Limited | Full checkpoint support |
| K-mer Coverage | Manual calculation | Automatic with --optimized |
| Input Format | GZIP required | GZIP not required |
| Logging | Basic stdout | Comprehensive with levels |
Command Translation
v3 fullauto mode:
# Old (v3)
automated-polishing_v3.sh fullauto -d draft.fasta -r reads.fq.gz -s pb -t 32
# New (v4)
t2t-polish polish -d draft.fasta -r reads.fq \
--singularity_sif deepvariant.sif --deepseq_type PACBIO \
--optimized -t 32
v3 optimizedpolish mode:
# Old (v3)
automated-polishing_v3.sh optimizedpolish -d draft.fasta -r reads.fq.gz \
-s pb --fitted_hist lookup_table.txt --peak 106.7
# New (v4)
t2t-polish polish -d draft.fasta -r reads.fq \
--singularity_sif deepvariant.sif --deepseq_type PACBIO \
--fitted_hist lookup_table.txt --ideal_dpeak 106.7
Notable Changes
- No GZIP requirement: v4 accepts uncompressed FASTQ/FASTA
- Automatic readmers: When using
--optimized, readmers DB is computed automatically - Read corrector info: FASTA inputs can specify corrector type (hifiasm, herro, flye) for quality score assignment
Legacy Version 3
The bash-based version 3 is still available in the legacy/ directory for users who prefer the original implementation or don't have GPU access. See legacy/README.md for v3 documentation.
Future Roadmap
- Multi-GPU support for DeepVariant
- Integration with cloud computing platforms
- Advanced QV visualization and reporting
- Support for additional variant callers
T2T-CHM13 Original Resources
For exact command lines and workflows used to generate the T2T-CHM13v1.0 and T2T-CHM13v1.1 assemblies, please refer to the Methods section in the CHM13-Issues repo. Note that some of the tools have been updated since then, and are tracked on this repo.
This README contains details about applying the automated polishing on general genome assemblies using the latest version 4 implementation.
The original script used in McCartney et al, 2021 has been substantially enhanced through version 3 (shell-based with Racon) and now version 4 (Python-based with DeepVariant). Each version represents significant improvements in speed, accuracy, and usability.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file t2t_polish-4.1.2.tar.gz.
File metadata
- Download URL: t2t_polish-4.1.2.tar.gz
- Upload date:
- Size: 43.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56a39a2e62299a898023184fc006f7e96db1f7c0c998c18b0fad96f46e478fcb
|
|
| MD5 |
caaf673e684a50b963cc6e937ebb1b6b
|
|
| BLAKE2b-256 |
3151497eb3a65f8cb061368312a683419248fedad80662ecc5af4224da5ac0d1
|
Provenance
The following attestation bundles were made for t2t_polish-4.1.2.tar.gz:
Publisher:
pypi-publish.yml on pgrady1322/T2T-Polish
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
t2t_polish-4.1.2.tar.gz -
Subject digest:
56a39a2e62299a898023184fc006f7e96db1f7c0c998c18b0fad96f46e478fcb - Sigstore transparency entry: 1019500731
- Sigstore integration time:
-
Permalink:
pgrady1322/T2T-Polish@7d8a72c05b0434f92cca3194c816fa63166f1535 -
Branch / Tag:
refs/tags/v4.1.2 - Owner: https://github.com/pgrady1322
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@7d8a72c05b0434f92cca3194c816fa63166f1535 -
Trigger Event:
push
-
Statement type:
File details
Details for the file t2t_polish-4.1.2-py3-none-any.whl.
File metadata
- Download URL: t2t_polish-4.1.2-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21a5398ba52e3a4b94156f13c6e42574d0419799b9dd05eb7a4d1cd75c1f4d77
|
|
| MD5 |
bcb5bbff1849c417478305aabb3a2175
|
|
| BLAKE2b-256 |
0e593c844561a6e7044105b7305a48cad1ba53cfdf72b4709ff1fb55f025cfeb
|
Provenance
The following attestation bundles were made for t2t_polish-4.1.2-py3-none-any.whl:
Publisher:
pypi-publish.yml on pgrady1322/T2T-Polish
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
t2t_polish-4.1.2-py3-none-any.whl -
Subject digest:
21a5398ba52e3a4b94156f13c6e42574d0419799b9dd05eb7a4d1cd75c1f4d77 - Sigstore transparency entry: 1019500778
- Sigstore integration time:
-
Permalink:
pgrady1322/T2T-Polish@7d8a72c05b0434f92cca3194c816fa63166f1535 -
Branch / Tag:
refs/tags/v4.1.2 - Owner: https://github.com/pgrady1322
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@7d8a72c05b0434f92cca3194c816fa63166f1535 -
Trigger Event:
push
-
Statement type: