A pipeline for searching species specific primer sets at genome level

Project description

Primer designer version 0.1.7

Introduction

This python package is designed for searching species-specific primers sets (including internal oligo probe) which have both high amplification efficiency and specificity to target species in genomic scope. Intended application scenario is for designing primers which can specifically detect certain pathogens whereas other genetically closely related pathogens may also appear in the same sample.

Traditionally, the template gene is specificied before primer design. But in some condition, such a gene which is both suitable for primer design and shows adequate amount of genetical difference may not be so easy to find. Compared with other primer design program, the main advantage of this pipeline is that it automated the progress of finding a suitable gene.

Examples are provided in the "example" folder. Including the directory tree of raw dataset, steps for constructing database, and the usage of pipeline itself.

For detailed information, you can also check the source code directly.

Installation

Pythonic dependencies and the program itself can be installed by pip command

pip install primerdesigner-0.1.7-py3-none-any.whl

The package is also available on pypi website: https://pypi.org/project/primer-set-designer/

This program also requires BLAST and mafft to be installed, please check their official release

BLAST: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

mafft: https://mafft.cbrc.jp/alignment/software/

Using different versions of mafft or BLAST may lead to subtle differences in results.

Example

An example usage is provided in example folder. To run the script, install the package and move to the directory of example.py. Type

python3 example.py

in console and the example script will start running.

An example output is also provided, you can use tar command to unzip.

Dataset Directory Tree Structure and Database Construction

A dataset is a group of files downloaded from ncbi dataset website: https://www.ncbi.nlm.nih.gov/data-hub/genome/.

In the "Download Package" options, the options "Genome sequences (FASTA)", "Annotation features (GFF)", and "Assembly data report (JSONL)" can be selected and downloaded directly. Users can also create their own dataset using genomic sequence fasta and gff annotation, but they must let the dataset have the same structure with the official one. A dataset must have the following structure:

root_dir
├── assembly_data_report.jsonl
├── dataset_catalog.json
├──`<first assembly folder named with assembly accession>`
│   ├── files
│   └── ...
├── `<second assembly folder named with assembly accession>`
│   ├── files
│   └── ...
├── `<other assembly folders named with assembly accession>`
└──...

dataset_catlog.json contains the mappings of all files in the folder, this file is also used for detecting a dataset. assembly_data_report.jsonl contains the meta data for all assemblies. They are both not omittable. Be sure that customized datasets contain both of them and their format are correct.

A database can be build from raw dataset by

primerdesigner.database.Database.build(input_dir = "your_input_dir", output_dir = "your_output_dir")

The input and output directory needs to be specified.

Once built, a database can be loaded again quickly by using

primerdesigner.database.Database(config_path = "path_to_your_dataset_catlog")

You also need to specify path to dataset_catalog.json of your database.

If users want to build database using multiple datsets, they should merge the dataset into one folder and modify dataset_catlog.json and assembly_data_report.jsonl. The static method: Database.merge() is designed for this purpose.

Multiple datsets can be merged into a database by using

primerdesigner.database.Database.merge(input_dir = "your_input_dir", output_dir = "your_output_dir")

This method will check all children folders in the input folder recursively and try to find all valid datasets, and then merge them into a single dataset.

During the merging process, the program will prompt "Delete existing files? (Y/n):" for confirmation.

Pipeline Parameter Explaination

The main entry point for the pipeline is primerdesigner.primer_designer.find().

It has 6 parameters:

db: database object loaded and constructed from raw dataset.
include: a list of identifers for genome assemblies which the final primer set output must be able to amplify. Genus name, species name and organism id are allowed. The pipeline will find all assemblies matches these identifiers.
exclude: a list of identifers for genome assemblies which the final primer set output must NOT be able to amplify. The looking up stragegy is the same as "include" parameter. Note that if a assembly is both marked as include and exclude, it will be considered as exclude.
workers: thread number for BLAST and mafft.
pick_probe: whether picking internal oligo probe is necessary. Note that default parameters for primer3 is different when this flag is turned on or off. Processing logic is also a bit different. You should not turn on this option when designing primers without probe.
reference_id: In the pipeline, a reference genome will be selected from annotated genomes in "include" group according to the sequencing quality and coverage. The following homologous gene search and primer design is mainly based on the sequence of this assembly. You can also manually specify.

Users can run the pipeline like this

from primerdesigner.database import Database
from primerdesigner.primer_designer import find

Database.build(input_dir="./example_raw_dataset/",output_dir="./example_database/")
db = Database(config_path="./example_database/config.json")
find(
    db=db,
    include=[
        "Cryptococcus gattii"
    ],
    exclude=[
        "Cryptococcus neoformans"
    ],
    pick_probe=True,
    reference_id="GCF_000185945.1",
    workers=2
)

The example above is used for searching primer and probe sets which are specific to Cryptococcus gattii while unable to amplify Cryptococcus neoformans, the id of reference genome assembly is "GCF_000185945.1" and working threads is 2.

(WIP)Changing parameters for filtering homologous group and primer3 core will be supported before release.

Output

The output contains a brief table about the sequences and amplification properties of primer sets which passed filtering and detailed reports for each homologous group found by BLAST and aligned by mafft, masked regions will be shown below the sequences.

Also, if feasible primer sets are found in a homologous group, a report about all primers in this homologous group will be provided seperately.

Project details

Release history Release notifications | RSS feed

This version

0.1.7

Feb 18, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

primer_set_designer-0.1.7.tar.gz (29.9 kB view hashes)

Uploaded Feb 18, 2024 Source

Built Distribution

primer_set_designer-0.1.7-py3-none-any.whl (31.9 kB view hashes)

Uploaded Feb 18, 2024 Python 3

Hashes for primer_set_designer-0.1.7.tar.gz

Hashes for primer_set_designer-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`9d1feaedb1cc5e87ced933f50dc19626f2a2f8da2f5ffe722a96e61c44c568f9`
MD5	`16cc06513ab21ddff556cbe3008cd27a`
BLAKE2b-256	`fb12c5add4bd506234e722014ec12d56c6a3699bbe7367a7d0b6dff9920e309b`

Hashes for primer_set_designer-0.1.7-py3-none-any.whl

Hashes for primer_set_designer-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fc7b3b4f266848dd47fd99464f0b9dd7ae81b815e065342d486b9460acf09cc`
MD5	`4d0e72117f313307eb3c41c32350f904`
BLAKE2b-256	`bf686967e5e8738acc91ede9cbed0f94b5e9f8b7389bc763ae8d000af9449352`