Skip to main content

Generate target-enrichment probes from a set of genome assemblies.

Project description

proBait

proBait is a tool for designing baits/probes for target enrichment/target capture experiments of bacterial pathogens. proBait starts by shredding a set of reference genome assemblies (in FASTA format) to create an initial set of baits. This is followed by an iterative mapping approach (that uses minimap2) in which the initial set of baits is mapped against the input genome assemblies to determine the genomic regions not covered by the baits and generate new baits to cover those regions according to the parameters set by the user. proBait also includes options to cluster the generated bait set with MMseqs2 to reduce redundancy and the removal of baits that are similar to a host or contaminant by mapping the generated bait set against the host and contaminant sequences.

Installation

We recommend creating a separate environment to install proBait and its dependencies.

Pip

pip3 install probait

Source

Just cd into the proBait directory after cloning this repository and run:

pip install .

Python dependencies

  • biopython>=1.83
  • plotly>=5.22.0
  • pandas>=1.5.3
  • datapane>=0.17.0

These dependencies are automatically installed in any of the aforementioned installation methods.

Other dependencies

These dependencies are not installed automatically. Please install them in the environment you are working on.

Usage

usage: proBait.py [-h] -i INPUT_FILES -o OUTPUT_DIRECTORY [-gb] [-b BAITS] [-bp BAIT_PROPORTION] [-rf REFS] [-msl MINIMUM_SEQUENCE_LENGTH]
                  [-bs BAIT_SIZE] [-bo BAIT_OFFSET] [-bi BAIT_IDENTITY] [-bc BAIT_COVERAGE] [-mr MINIMUM_REGION] [-me MINIMUM_EXACT_MATCH]
                  [-c] [-ci CLUSTER_IDENTITY] [-cc CLUSTER_COVERAGE] [-e EXCLUDE] [-ep EXCLUDE_PIDENT] [-ec EXCLUDE_COVERAGE] [-t THREADS]
                  [-r] [-ri REPORT_IDENTITIES [REPORT_IDENTITIES ...]] [-rc REPORT_COVERAGES [REPORT_COVERAGES ...]] [-tsv]

Purpose
-------

Generate baits for target capture experiments.

Code documentation
------------------

options:
  -h, --help            show this help message and exit
  -i INPUT_FILES, --input-files INPUT_FILES
                        Path to the directory that contains the input FASTA files.
  -o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY
                        Path to the output directory that will be created to store output files. The process will exit if the directory
                        already exists.
  -gb, --generate-baits
                        Pass this parameter to generate baits based on the sequences in the input files.
  -b BAITS, --baits BAITS
                        Path to a FASTA file with baits generated on a previous run.
  -bp BAIT_PROPORTION, --bait-proportion BAIT_PROPORTION
                        Path to a TSV file with data about bait proportion to use when evaluating bait performance and generating the
                        reports.
  -rf REFS, --refs REFS
                        Path to a file with the basenames of the input FASTA files, one basename per line, that will be used as references
                        to create the initial set of baits. The references are shredded into baits according to the --bait-size and
                        --bait-offset values.
  -msl MINIMUM_SEQUENCE_LENGTH, --minimum-sequence-length MINIMUM_SEQUENCE_LENGTH
                        Do not generate baits for sequences shorter than this value.
  -bs BAIT_SIZE, --bait-size BAIT_SIZE
                        The length of the baits in bases. All the baits that are generated during the process have a size equal to the
                        passed value.
  -bo BAIT_OFFSET, --bait-offset BAIT_OFFSET
                        Start position offset between consecutive baits.
  -bi BAIT_IDENTITY, --bait-identity BAIT_IDENTITY
                        Minimum percent identity to accept an alignment between a bait and a region of an input sequence.
  -bc BAIT_COVERAGE, --bait-coverage BAIT_COVERAGE
                        Minimum proportion of a bait that must align to accept an alignment.
  -mr MINIMUM_REGION, --minimum-region MINIMUM_REGION
                        The process will only generate new baits for uncovered regions with length greater than this value.
  -me MINIMUM_EXACT_MATCH, --minimum-exact-match MINIMUM_EXACT_MATCH
                        Minimum number of N sequential matching bases in an alignment to accept it.
  -c, --cluster         Cluster set of baits to remove similar baits and reduce redundancy.
  -ci CLUSTER_IDENTITY, --cluster-identity CLUSTER_IDENTITY
                        Exclude baits with an identity value to the cluster representative equal to or higher than this value.
  -cc CLUSTER_COVERAGE, --cluster-coverage CLUSTER_COVERAGE
                        Exclude baits with a coverage value to the cluster representative equal to or higher than this value.
  -e EXCLUDE, --exclude EXCLUDE
                        Path to a FASTA file containing sequences to which baits must not be specific.
  -ep EXCLUDE_PIDENT, --exclude-pident EXCLUDE_PIDENT
                        Exclude baits with an identity value to a region of a sequence to exclude equal or higher than this value.
  -ec EXCLUDE_COVERAGE, --exclude-coverage EXCLUDE_COVERAGE
                        Exclude baits with a coverage value to a region of a sequence to exclude equal or higher than this value.
  -t THREADS, --threads THREADS
                        Number of threads passed to minimap2 and MMseqs2.
  -r, --report          Evaluate bait performance against input sequences and generate an interactive report with results.
  -ri REPORT_IDENTITIES [REPORT_IDENTITIES ...], --report-identities REPORT_IDENTITIES [REPORT_IDENTITIES ...]
                        List of identity values used to evaluate bait performance. proBait will generate a report per identity value. An
                        equal number of coverage values must be provided to the --report-coverages parameter to pair with the identity
                        values.
  -rc REPORT_COVERAGES [REPORT_COVERAGES ...], --report-coverages REPORT_COVERAGES [REPORT_COVERAGES ...]
                        List of coverage values used to evaluate bait performance. proBait will generate a report per coverage value. An
                        equal number of identity values must be provided to the --report-identities parameter to pair with the coverage
                        values.
  -tsv, --tsv-output    Output bait set in TSV format (the first column includes the bait sequence identifier, and the second column includes
                        the bait DNA sequence).

Citation

Please cite this repository if you use proBait.

Mamede, R., & Ramirez, M. (2024). proBait (Version 0.1.0) [Computer software]. https://github.com/B-UMMI/proBait

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

probait-0.1.0.tar.gz (62.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proBait-0.1.0-py3-none-any.whl (50.9 kB view details)

Uploaded Python 3

File details

Details for the file probait-0.1.0.tar.gz.

File metadata

  • Download URL: probait-0.1.0.tar.gz
  • Upload date:
  • Size: 62.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.13

File hashes

Hashes for probait-0.1.0.tar.gz
Algorithm Hash digest
SHA256 820cffc3a9370983c91adfa28941a385247329c856838e137648051280fabf26
MD5 bf61def1cd9d9a5424c121c0791632d3
BLAKE2b-256 c0ad1aed3a398b90be352acc1b14ad761aadb67bb95f839056685752b7acf336

See more details on using hashes here.

File details

Details for the file proBait-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: proBait-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.13

File hashes

Hashes for proBait-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41e9e220677f82754b48f44f294e90d2814a7a100f06b45ead4e07280deb1a9c
MD5 34c09a31fbd7c0af5fa9a47ee00b9c86
BLAKE2b-256 1285e759b064865a4ecd172fa8ce16fe17edaa78a0900230b99149e9cf1ffc44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page