Skip to main content

Assemble contigs into a chromosome-scalse pseudo-assembly using alignments to a reference sequence. Download the github repository for helper scripts to automate GPatch workflows, identify and correct misjoins in the contig assembly, produce dot-plots of patched pseudoassemblies to a reference assembly, and generate chrom.sizes and liftover chains for patched pseudoassemblies.

Project description

GPatch

Assemble contigs into a chromosome-scalse pseudo-assembly using alignments to a reference sequence.

Starting with alignments of contigs to a reference genome, produce a chromosome-scale pseudoassembly by patching gaps between mapped contigs with sequences from the reference.

Dependencies

We recommend using minimap2 for alignment, using the -a option to generate SAM output.

Installation

We recommend installing with conda, into a new environment:

conda create -n GPatch -c conda-forge -c bioconda Bio pysam minimap2 samtools GPatch

Install with pip:

pip install GPatch

Installation from the github repository is not recommended. However, if you must, follow the steps below:

  1. git clone https://github.com/adadiehl/GPatch
  2. cd GPatch/
  3. python3 -m pip install -e .

Usage

GPatch.py [-h] -q SAM/BAM -r FASTA [-x STR] [-b FILENAME] [-m N] [-w PATH] [-d] [-s] [-l N] [-e] [-k]

Starting with alignments of contigs to a reference genome, produce a chromosome-scale pseudoassembly by patching gaps between mapped contigs with sequences from the reference. By default, reference chromosomes with no mapped contigs are printed to output unchanged. Use the --drop_missing option to disable this behavior. By default, patches are applied to the 5' and 3' telomere ends of pseudochromsomes if the first and last contig alignments do not extend to the start/end of the reference chromsome. In some cases, this may cause spurious duplications. Use the --no_extend option if this is a concern. Note that GPatch is designed to be run on single-haplotype or unphased genome assemblies. For phased assemblies, each haplotype should be separated into its own input FASTA file prior to alignment. GPatch can then be run separately on the BAM files for each haplotype to obtain phased pseudoassemblies, otherwise results will be unpredictable and likely incorrect.

Required Arguments

Argument Description
-q SAM/BAM, --query_bam SAM/BAM Path to SAM/BAM file containing non-overlapping contig mappings to the reference genome.
-r FASTA, --reference_fasta FASTA Path to reference genome fasta.

Optional Arguments:

Argument Description
-h, --help Show this help message and exit.
-x STR, --prefix STR Prefix to add to output file names. Default=None
-b FILENAME, --store_final_bam FILENAME Store the final set of primary contig alignments to the given file name. Default: Do not store the final BAM.
-m N, --min_qual_score N Minimum mapping quality score to retain an alignment. Default=30
-w PATH, --whitelist PATH Path to BED file containing whitelist regions: i.e., the inverse of blacklist regions. Supplying this will have the effect of excluding alignments that fall entirely within blacklist regions. Default=None
-d, --drop_missing Omit unpatched reference chromosome records from the output if no contigs map to them. Default: Unpatched chromosomes are printed to output unchanged.
-s, --scaffold_only Pad gaps between placed contigs with strings of N characters instead of patching with sequence from the reference assembly. Effectively turns GPatch into a reference-guided scaffolding tool. Note that patches.bed will still be generated to document (inverse) mapped contig boundaries in reference frame.
-l N, --gap_length N Length of "N" gaps separating placed gontigs when using --scaffold_only. Has no effect when in default patching mode. Default=Estimate gap length from alignment.
-e, --no_extend Do not patch telomere ends of pseudochromosomes with reference sequence upstream of the first mapped contig and downstream of the last mapped contig. Default is to include 5' and 3' patches to extend telomeres to the ends implied by the alignment.
-k, --keep_nested Do not drop contigs with mapped positions nested entirely inside other mapped contigs. Instead, these will be bookended after the contig in which they are nested. Default is to drop contigs with mapped positions nested entirely within other mapped contigs. This option should be used with caution as these mappings cannot be placed unambigiously relative to other mapped contigs, thus including them is likely to lead to unpredictable and possibly incorrect results. Do not use this unless you are sure you know what you are doing!

Output

GPatch produces three output files:

File Description
patched.fasta The final patched genome.
contigs.bed Location of contigs in the coordinate frame of the patched genome.
patches.bed Location of patches in the coordinate frame of the reference genome.

Helper Scripts

The scripts directory contains helper scripts for running GPatch and working with its output, including a shell script to automate a two-stage patching process, including initial alignment and patching steps, misjoin breakpoint prediction and contig-breaking, and subsequent realignment and patching of the split contigs, with dot-plots against the reference assembly created after both patching stages. In addition, scripts are provided to generate a chrom.sizes file for a patched pseudoassembly and a set of liftover chains that can be used to translate features mapped to the unpatched contigs to the patched pseudoassembly. Please see the README.md within the scripts for detailed usage information.

Citing GPatch

Please use the following citation if you use this software in your work:

Fast and Accurate Draft Genome Patching with GPatch Adam Diehl, Alan Boyle bioRxiv 2025.05.22.655567; doi: https://doi.org/10.1101/2025.05.22.655567

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpatch-0.4.0.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpatch-0.4.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file gpatch-0.4.0.tar.gz.

File metadata

  • Download URL: gpatch-0.4.0.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.4

File hashes

Hashes for gpatch-0.4.0.tar.gz
Algorithm Hash digest
SHA256 f9ed35addbdc57b74a7b76a66e6d52e63ea6f61176b5632df77fbc824fc4e80c
MD5 67c9a471d67dc83ffd02d7be85aef7ee
BLAKE2b-256 05c5956767d226070e56edf435eb2b77094f3944a71f9c2172be2e835f02e09b

See more details on using hashes here.

File details

Details for the file gpatch-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: gpatch-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.4

File hashes

Hashes for gpatch-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 075aa4cf9b67bdf9f1cdcf2ba71a0a048e80b57cb9efcc505ced5a610521b01f
MD5 2b13cca97fc31314b92b9a8af4eb19fd
BLAKE2b-256 999d5dbfd3bdb821c96f517d8df39f9f4ef75ebf40728f669d367a0bc6bc85d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page