Skip to main content

A pipeline for searching species specific primer sets at genome level

Project description

Primer designer version 0.1.7

Introduction

This python package is designed for searching species-specific primers sets (including internal oligo probe) which have both high amplification efficiency and specificity to target species in genomic scope. Intended application scenario is for designing primers which can specifically detect certain pathogens whereas other genetically closely related pathogens may also appear in the same sample.

Traditionally, the template gene is specificied before primer design. But in some condition, such a gene which is both suitable for primer design and shows adequate amount of genetical difference may not be so easy to find. Compared with other primer design program, the main advantage of this pipeline is that it automated the progress of finding a suitable gene.

Examples are provided in the "example" folder. Including the directory tree of raw dataset, steps for constructing database, and the usage of pipeline itself.

For detailed information, you can also check the source code directly.

Installation

Pythonic dependencies and the program itself can be installed by pip command

pip install primerdesigner-0.1.7-py3-none-any.whl

The package is also available on pypi website: https://pypi.org/project/primer-set-designer/

This program also requires BLAST and mafft to be installed, please check their official release

BLAST: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

mafft: https://mafft.cbrc.jp/alignment/software/

Using different versions of mafft or BLAST may lead to subtle differences in results.

Example

An example usage is provided in example folder. To run the script, install the package and move to the directory of example.py. Type

python3 example.py

in console and the example script will start running.

An example output is also provided, you can use tar command to unzip.

Dataset Directory Tree Structure and Database Construction

A dataset is a group of files downloaded from ncbi dataset website: https://www.ncbi.nlm.nih.gov/data-hub/genome/.

In the "Download Package" options, the options "Genome sequences (FASTA)", "Annotation features (GFF)", and "Assembly data report (JSONL)" can be selected and downloaded directly. Users can also create their own dataset using genomic sequence fasta and gff annotation, but they must let the dataset have the same structure with the official one. A dataset must have the following structure:

root_dir
├── assembly_data_report.jsonl
├── dataset_catalog.json
├──`<first assembly folder named with assembly accession>`
│   ├── files
│   └── ...
├── `<second assembly folder named with assembly accession>`
│   ├── files
│   └── ...
├── `<other assembly folders named with assembly accession>`
└──...

dataset_catlog.json contains the mappings of all files in the folder, this file is also used for detecting a dataset. assembly_data_report.jsonl contains the meta data for all assemblies. They are both not omittable. Be sure that customized datasets contain both of them and their format are correct.

A database can be build from raw dataset by

primerdesigner.database.Database.build(input_dir = "your_input_dir", output_dir = "your_output_dir")

The input and output directory needs to be specified.

Once built, a database can be loaded again quickly by using

primerdesigner.database.Database(config_path = "path_to_your_dataset_catlog")

You also need to specify path to dataset_catalog.json of your database.

If users want to build database using multiple datsets, they should merge the dataset into one folder and modify dataset_catlog.json and assembly_data_report.jsonl. The static method: Database.merge() is designed for this purpose.

Multiple datsets can be merged into a database by using

primerdesigner.database.Database.merge(input_dir = "your_input_dir", output_dir = "your_output_dir")

This method will check all children folders in the input folder recursively and try to find all valid datasets, and then merge them into a single dataset.

During the merging process, the program will prompt "Delete existing files? (Y/n):" for confirmation.

Pipeline Parameter Explaination

The main entry point for the pipeline is primerdesigner.primer_designer.find().

It has 6 parameters:

  • db: database object loaded and constructed from raw dataset.
  • include: a list of identifers for genome assemblies which the final primer set output must be able to amplify. Genus name, species name and organism id are allowed. The pipeline will find all assemblies matches these identifiers.
  • exclude: a list of identifers for genome assemblies which the final primer set output must NOT be able to amplify. The looking up stragegy is the same as "include" parameter. Note that if a assembly is both marked as include and exclude, it will be considered as exclude.
  • workers: thread number for BLAST and mafft.
  • pick_probe: whether picking internal oligo probe is necessary. Note that default parameters for primer3 is different when this flag is turned on or off. Processing logic is also a bit different. You should not turn on this option when designing primers without probe.
  • reference_id: In the pipeline, a reference genome will be selected from annotated genomes in "include" group according to the sequencing quality and coverage. The following homologous gene search and primer design is mainly based on the sequence of this assembly. You can also manually specify.

Users can run the pipeline like this

from primerdesigner.database import Database
from primerdesigner.primer_designer import find

Database.build(input_dir="./example_raw_dataset/",output_dir="./example_database/")
db = Database(config_path="./example_database/config.json")
find(
    db=db,
    include=[
        "Cryptococcus gattii"
    ],
    exclude=[
        "Cryptococcus neoformans"
    ],
    pick_probe=True,
    reference_id="GCF_000185945.1",
    workers=2
)

The example above is used for searching primer and probe sets which are specific to Cryptococcus gattii while unable to amplify Cryptococcus neoformans, the id of reference genome assembly is "GCF_000185945.1" and working threads is 2.

(WIP)Changing parameters for filtering homologous group and primer3 core will be supported before release.

Output

The output contains a brief table about the sequences and amplification properties of primer sets which passed filtering and detailed reports for each homologous group found by BLAST and aligned by mafft, masked regions will be shown below the sequences.

Also, if feasible primer sets are found in a homologous group, a report about all primers in this homologous group will be provided seperately.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

primer_set_designer-0.1.7.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

primer_set_designer-0.1.7-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file primer_set_designer-0.1.7.tar.gz.

File metadata

  • Download URL: primer_set_designer-0.1.7.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.12

File hashes

Hashes for primer_set_designer-0.1.7.tar.gz
Algorithm Hash digest
SHA256 9d1feaedb1cc5e87ced933f50dc19626f2a2f8da2f5ffe722a96e61c44c568f9
MD5 16cc06513ab21ddff556cbe3008cd27a
BLAKE2b-256 fb12c5add4bd506234e722014ec12d56c6a3699bbe7367a7d0b6dff9920e309b

See more details on using hashes here.

File details

Details for the file primer_set_designer-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for primer_set_designer-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1fc7b3b4f266848dd47fd99464f0b9dd7ae81b815e065342d486b9460acf09cc
MD5 4d0e72117f313307eb3c41c32350f904
BLAKE2b-256 bf686967e5e8738acc91ede9cbed0f94b5e9f8b7389bc763ae8d000af9449352

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page