Skip to main content

Standalone Python implementation of the POLCA polisher from MaSuRCA

Project description

CI Code style: black DOI

Anaconda-Server Badge Bioconda Downloads PyPI version Downloads

pypolca

pypolca is a Standalone Python re-imnplementation of the POLCA polisher from the MaSuRCA genome assembly and analysis toolkit.

Quick Start

# creates conda environment with pypolca 
conda create -n pypolca_env pypolca

# activates conda environment
conda activate pypolca_env

# runs pypolca
pypolca run -a <genome> -1 <R1 short reads file> -2 <R2 short reads file> -t <threads> -o <output directory> 

Table of Contents

Description

pypolca is a python reimplenetation of the POLCA polisher from the MaSuRCA genome assembly and analysis toolkit that was made for inclusion into the hybrid bacterial genome assembly tool hybracter.

It was written for a number of reasons:

  • MaSuRCA is only available on Linux, not for MacOS.
  • The original polca.sh script from MaSuRCA was difficult to use because you could not specify an output directory. Additionally, due to its shell implementation, both FASTQ read files needed to be input together as a string
  • To use polca.sh, you need to install the entire MaSuRCA assembly toolkit.
  • POLCA is recommended for long-read only bacterial only polishing (see this paper) and I wanted to include it for MacOS and Linux in my assembly tool hybracter.

Note: I neither guarantee nor desire that pypolca will give identical results to POLCA implemented in MaSuRCA. This is because of the different versions of freebayes that might be used as a dependency. I have decided to use the newest version of freebayes possible rather than the version installed with MaSuRCA. Testing is ongoing, but I doubt there will be many differences between pypolca and POLCA.

Note if you really want to replicate POLCA, the latest versions of MaSuRCA uses freebayes v1.3.1-dirty.

To enforce:

mamba create -n pypolca_env polca freebayes==1.3.1

Note of Caution for Large (e.g. Eukaryotic) Genomes

  • I have implemeted pypolca predominantly for the use-case of polishing long-read bacterial genome assemblies with short reads. Therefore, I decided not to implement the batched multiprocessing of freebayes included in POLCA, because it was a lot of work for no benefit for most bacterial genomes.
  • However, this is certainly not true for larger genomes such as eukaryotic organisms. pypolca should be a lot slower than POLCA for such organisms if you run both with more than 1 thread.
  • I do not intend to implement multiprocessing but if someone wants to feel free to make a PR.

Installation

Installation from conda is recommended as this will install all non-python dependencies.

Conda

pypolca will soon be available on bioconda.

conda install -c bioconda pypolca

Pip

You can also install the Python components of pypolca with pip.

pip install pypolca

Source

Alternatively, the development version of pypolca can be installed manually via github.

git clone https://github.com/gbouras13/pypolca.git
cd pypolca
pip install -e .
pypolca -h

If you have install pypolca with pip or from source, you will then need to install the external dependencies separately, which can be found in build/environment.yaml

Usage

Usage: pypolca [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  citation  Print the citations for polca
  run       Python implementation of the POLCA polisher from MaSuRCA
Usage: pypolca run [OPTIONS]

  Python implementation of the POLCA polisher from MaSuRCA

Options:
  -h, --help               Show this message and exit.
  -V, --version            Show the version and exit.
  -a, --assembly PATH      Path to assembly contigs or scaffolds.  [required]
  -1, --reads1 PATH        Path to polishing reads R1 FASTQ. Can be FASTQ or
                           FASTQ gzipped. Required.  [required]
  -2, --reads2 PATH        Path to polishing reads R2 FASTQ. Can be FASTQ or
                           FASTQ gzipped. Optional. Only use -1 if you have
                           single end reads.
  -t, --threads INTEGER    Number of threads.  [default: 1]
  -o, --output PATH        Output directory path  [default: output_polca]
  -f, --force              Force overwrites the output directory
  -n, --no_polish          do not polish, just create vcf file, evaluate the
                           assembly and exit
  -m, --memory_limit TEXT  Memory per thread to use in samtools sort, set to
                           2G or more for large genomes  [default: 2G]
  -p, --prefix TEXT        prefix for output files  [default: polca]

The polished output FASTA will be {prefix}_corrected.fasta in the specified output directory and the POLCA report will be the textfile {prefix}.report

Citation

Please cite pypolca in your paper using:

Bouras G, Zimin AV (2023) pypolca: Standalone Python reimplementation of the genome polishing tool POLCA. https://github.com/gbouras13/pypolca.

Zimin AV, Salzberg SL (2020) The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput Biol 16(6): e1007981. https://doi.org/10.1371/journal.pcbi.1007981.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypolca-0.1.1.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

pypolca-0.1.1-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file pypolca-0.1.1.tar.gz.

File metadata

  • Download URL: pypolca-0.1.1.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.16 Darwin/21.6.0

File hashes

Hashes for pypolca-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2e0ab4f28aa64eb1363afc7952728196d3dfbfbd5cbb2a5d61950de28164ab36
MD5 11a76b1cde3938555f56a6944e10fb3a
BLAKE2b-256 3598b6fab51f7606cc2fa2da08f44e7bb2a540c1c93b46b4412bedd5cba3fc13

See more details on using hashes here.

File details

Details for the file pypolca-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pypolca-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.16 Darwin/21.6.0

File hashes

Hashes for pypolca-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 865027b5523fa287fdd610fadd4d3c522bfae3b7d1639b16ed6b5c8f34af03e1
MD5 4604fd1651c60291f057178e765b6ab5
BLAKE2b-256 9a3c6d6062b1e3171eedd0858c0717e711c14d158e4b7c1aaa2e2f2f671d10ac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page