This package contains the programs that design primer set for microbiological analysis and perform some accessory analysis.

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
- Python :: 3.8
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

shrs

Primer design method based on short length homologous region exhaustive search algorithm.

shrs
- Table of Contents
- Overview
- Environment
- Installation
  - pip
  - Installing from the source code
    - Download source code
    - Installation from source
- How to run shrs
- Functions contained in library of shrs
- Citation
- Update information
- Reference
- License

Overview

This package contains the programs that design a primer set for the analysis of bacteria and perform some accessory analyses. This package contains the following six subcommands.

AA (For Additional analysis)
DIP (For design identification primer sets)
DUP (For design universal primer sets)
ISP (For input sequence preprocessing)
iPCR (For insilico PCR)
DP (For design probes)

You can get a summary of the available command-line options of each command by using the following command: shrs, shrs {subcommand} -h. Every program accepts FASTA or GenBank format sequence files. The input file must be encoded in UTF-8 format.

Environment

This package passed the operation verification under Windows 10, Windows 11, macOS Sequoia, Ubuntu 18.04, Ubuntu 20.04 and Ubuntu 22.04. When the 'cupy' package is available, some calculations will be performed using the GPU. The 'cupy' package is NOT installed automatically, even if you use the pip install shrs command. Please install the version of the 'cupy' package that matches the version of CUDA on your computer by yourself. Type pip install shrs[GPU] if you would like to install 'cupy' packages automatically. Analyses by a PC with a GPU are highly recommended. The processing time when a GPU is used could be less than that when a CPU is used.

Installation

`pip`

Type the following command.

 $ pip install shrs

 $ pip install shrs[GPU]

Please check above. (Environment section)

Installing from the source code

Download source code

Download the source from the following URL. https://pypi.org/project/shrs/#files

Installation from source

Place the source file on the current working directory, and then type the following commands.

 $ tar -xvzf shrs-0.13.2.tar.gz
 $ cd shrs-0.13.2
 $ python setup.py build
 $ python setup.py install

How to run `shrs`

To confirm a subcommand, type shrs. The following message will be shown.

usage: shrs [-h] [-v] {AA,DIP,DUP,ISP,iPCR,DP} ...

--- HELP message ---

positional arguments:
  {AA,DIP,DUP,ISP,iPCR,DP}
    AA                  see 'AA -h'. Make a fragment size matrix from a new template sequence and the result containing primer sets that you have already generated.
    DIP                 see 'DIP -h'. Primer design algorithm for the identification of bacteria
    DUP                 see 'DUP -h'. Primer design algorithm for universal primer
    ISP                 see 'ISP -h'. Input sequences preprocessing program for DUP or DIP
    iPCR                see 'iPCR -h'. In silico PCR amplification algorithm
    DP                  see 'DP -h'. Probe design algorithm

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Sample data

Sample data can be downloaded from here

Algorithm	Hash value
MD5	9b40802495ce31bec85b5f541bd6a9d0
SHA1	ac84bc66e35c126f929d73c444846bec75d75920
SHA256	da69337717cc4bc101e3474e71b1884d6ba42b8f06772d113f30ff3664cb857e

Sample data contain a multi-FASTA file of complete genome DNAs of three strains of Mycoplasma genitalium, a FASTA file of the complete genome DNA of one strain of Mycoplasma pneumoniae, a multi-FASTA file of 25 contig sequences of Mycoplasma genitalium G37 and GenBank files of the complete genome DNAs of five strains of Mycoplasma genitalium. Place SampleData in the current working directory.

Workflow for design primer

1. Design primer sets for identification without input data preprocessing.

Specify the multi-FASTA file or the folder containing target sequences (FASTA or GenBank format) as input file(s) after the '-i' option.

 $ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode sparse

It takes approx. 20 minutes to 2 hours, depending on the PC's performance. After the analysis, the results will be output in a SampleData/Result/(Timestamp)identify_primer_set/ folder. The CSV file contains a results table as shown below with the analysis parameters.

	Arguments
Input_file_name	sample.fasta
Target	Strain1, Strain2, ... , Strain N
Exclude_file_name	None
Exclude
Probe_size_range_description	23-25
allowance	0.15
cut_off_lower	50
cut_off_upper	1000
Interval_distance	10000
Match_rate	0.8
Result_output	10000
Search_mode	sparse
Window_size	950
Maximum_annealing_site_number	5
Score_calculation_mode	Fragment
Search_area	Whole genome sequence
Reference_tree	Not provided

Primer1	Primer2	No.	Input sequence1	Input sequence2	...	Input sequenceN	Score	Fragment number	Forward Tm_value	Reverse Tm_value	Tm_value difference
ATG..AA	AGC..TA	1	[(655, 1.0)]	[(528, 1.0)]	...	[(411, 1.0)]	1200	N	60.0	61.5	1.5
TAG..TC	GCC..TG	2	[(344, 1.0)]	[(257, 1.0)]	...	[(499, 1.0)]	1200	N	58.0	60.5	2.5

Input sequence columns indicate the amplicon length amplified by Primers 1 and 2 and its ratio to the total amplicon number [(Amplicon length, Ratio)]. Therefore, if a primer set produces three amplicons (355 bp, 355 bp, and 652 bp) from whole genome DNA, the results would be [(355, 0.67), (652, 0.33)]. Each primer set can be used for identification; however, combining primer sets will make the identification easier and more accurate. When the analysis is run with default settings, three primer sets are selected, and a dendrogram based on the information of fragments amplified by each primer set will be generated in the 'Dendrogram' folder. The 'Combination_number' option (default: 3) can be used to adjust the number of primer sets used for identification. This algorithm DIP designs primer sets that produce fragments from all input sequences and maximizes the differences in the amplicon size or amplicon sequence among input sequences. To confirm command-line options, type shrs DIP -h.

2. Design universal primer sets without input data preprocessing.

Specify the Multi-FASTA or the folder containing the target sequences (FASTA or GenBank format) as input file(s) after the '-i' option. In the following case, the design primer set can amplify genome DNAs of three strains of Mycoplasma genitalium and NOT produce any amplicons from genome DNA of one strain of Mycoplasma pneumoniae.

 $ shrs DUP -i SampleData/Mgenitalium_3strains.fasta -e SampleData/Mpneumoniae.fasta -a 0.25 -o SampleData/ --Search_mode sparse

It takes approx. 5–10 minutes, depending on the PC's performance. After the analysis, the results will be output in a SampleData/Result/(Timestamp)universal_primer_set/ folder. The format of the outputted CSV file is almost the same as that obtained for the results of a DIP analysis as mentioned above (see Procedure 1). This algorithm DUP designs primer sets that produce fragments from all input sequences and minimizes the differences in the amplicon size among input sequences. Note that the primer sets are designed to produce only one amplicon from each input sequence when shrs DUP is used.

To confirm command-line options, type shrs DUP -h.

3. Design primer sets for identification with input data preprocessing.

Each sequence that is input into the DIP subcommand at the same time is processed for differentiation from each other. Therefore, if an inputted multi-FASTA file has plasmid or multiple contig sequences of an identical strain, the primer set obtained will produce some amplicons that correspond to inputted contigs or plasmid. To avoid this problem, if genome DNA of an identical strain has been divided into multiple contigs, all contigs and the plasmid should be preprocessed (concatenated) prior to the DIP analysis, using the following subcommand. One multi-FASTA file for preprocessing must contain the contigs and plasmid derived from a single strain. Organize all multi-FASTA files in one folder when there are some multi-FASTA files, and then specify the folder path as the input file path.

 $ shrs ISP -i SampleData/contig/ -o SampleData/Preprocessed_data1/ --Single_file

A preprocessed multi-FASTA file will be generated, and then the preprocessed multi-FASTA file is analyzed by shrs DIP.

 $ shrs DIP -i SampleData/Preprocessed_data1/ -o SampleData/ --Search_mode sparse

This algorithm DIP designs primer sets that produce fragments from all input sequences and maximizes the differences in the amplicon size or amplicon sequence among input sequences.

To confirm command-line options, type shrs DIP -h or shrs ISP -h.

4. Design universal primer sets with input data preprocessing.

As mentioned above, if genome DNA of an identical strain has been divided into multiple contigs, all contigs and the plasmid should be preprocessed (concatenated) prior to a DUP analysis by using following subcommand. One multi-FASTA file for preprocessing must contain contigs and plasmid derived from a single strain. Organize all multi-FASTA files in one folder when there are some multi-FASTA files, and then specify the folder path as the input file path.

 $ shrs ISP -i SampleData/contig/ -o SampleData/Preprocessed_data2/ --Single_file

A preprocessed multi-FASTA file will be generated, and then the preprocessed multi-FASTA file is analyzed by shrs DUP.

 $ shrs DUP -i SampleData/Preprocessed_data2 -e SampleData/Mpneumoniae.fasta -a 0.20 -o SampleData/ --Search_mode sparse

This algorithm DUP designs primer sets that produce fragments from all input sequences and minimizes the differences in the amplicon size among input sequences. Note that the primer sets are designed to produce only one amplicon from each input sequence when shrs DUP is used.

To confirm command-line options, type shrs DUP -h or shrs ISP -h.

Re-analysis/Additional analysis

New sequences can be analyzed based on primer sets in a CSV file obtained from shrs DIP or shrs DUP. Delete the rows containing unnecessary primer sets, since an analysis of them would take a long time. Do not include blank rows between the rows containing primer sets.

 $ shrs AA -i SampleData/Mgenitalium_M6282.fasta -f SampleData/Previous_DIP_Result.csv -o SampleData/New_result/

Design probes

Specify the multi-FASTA file or the folder containing target sequences (FASTA or GenBank format) as input file(s) after the '-i' option. This algorithm DP designs probes that hybridize to all input sequences. Input sequence columns in result table indicate the annealing positions on each input sequence.

in silico PCR

You can get amplicon information by using in silico PCR.

 $ shrs iPCR -i SampleData/GenBank/ -fwd CTACACCCATTTACCCAACAGTATC -rev TCTCATTGGWAACATTCGGTACATC -o SampleData/insilicoPCR_Result/

A CSV file will be output with default settings. If you prefer FASTA over CSV, use the '--fasta' option. When you need to know the primer annealing site positions on a template sequence, use the '--Position_index' option. You can obtain the annotation information of amplicons, when a GenBank file is analyzed by in silico PCR with the '--Annotation' option.

 $ shrs iPCR -i SampleData/GenBank/ -fwd CTACACCCATTTACCCAACAGTATC -rev TCTCATTGGWAACATTCGGTACATC -o SampleData/insilicoPCR_annotation_Result/ --Annotation

If there are some primer sets, you can input a filepath of a text file that contains primer sets after -f option instead of typing primer sequences after -fwd and -rev option. The text file contains one primer set in each row and must be written in with utf-8 encoding. You can specify commas, spaces or tab character as a separator in the text file.

        Text file example:
            Forward_Primer_sequence1,Reverse_Primer_sequence1
            Forward_Primer_sequence2,Reverse_Primer_sequence2
            Forward_Primer_sequence3,Reverse_Primer_sequence3
                       ... 
            Forward_Primer_sequenceN,Reverse_Primer_sequenceN

Tips: Some command line options

'--circularDNA'/'--exclude_circularDNA' option {AA, DIP, DUP, ISP, iPCR}

When input sequence file(s) contains one or more circular DNA sequence(s), use this option. Specify which sequence is circular DNA by using a subargument. All input sequences are processed as linear DNA in default settings ('n/a'). When all input sequences are circular DNA, use 'all' after the --circularDNA/--exclude_circularDNA option. For each individual sequence, you can specify whether the sequence is circular DNA or not by using 'individually'. You can also specify whether or not a sequence is circular DNA by inputting the file path of text file, as shown below.
```
      Text file example:
          Sequence_name1 circularDNA
          Sequence_name2 linearDNA
          Sequence_name3 linearDNA
                     ... 
          Sequence_nameN circularDNA
```

 $ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode sparse --circularDNA SampleData/circularDNA.txt

'--Search_mode' option {DIP, DUP}

The '--Search_mode' option contains four choices: 'exhaustive', 'moderate', 'sparse', and 'manual'. Primer candidates will be generated by cutting every N-base from the reference sequence. The N-value is different depending on each Search_mode.

Subargument	N-bases
exhaustive	1 base. Same as N-gram method (N: primer size)
moderate	One-third of primer size (Maximum: 10 bases)
sparse	primer size
manual	primer size * Ratio inputed by user

 $ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode manual 2

'-a' option {DIP, DUP}

When the '-e' and '--Exclude_mode fast' options are used, the larger the value allowance is, the fewer in number the candidates will be.
'--Score_calculation' option {DIP}

When 'Sequence' is selected in this option, the cumulative score of sequence homology among amplicons is used as an indicator to evaluate primer set candidates instead of the score calculated from fragment length. When this option is used, only primer sets that produce a single amplicon from the template sequence will be obtained because calculation of a sequence homology takes a long time. Additionally, a multiple alignment tool (MAFFT program) is required to generate a dendrogram when 'Sequence' is specified in 'Score_calculation' option. Before running this script, please install MAFFT and add the binary of MAFFT to your PATH. The dendrogram is constructed based on the UPGMA method, and the distance matrix calculated by the identity matrix. The gap(s) in the alignment(s) are used to calculate distance (NOT ignored). Note that the dendrogram obtained indicates the similarity of alignment sequences containing gap(s), and it does not always indicate the evolutionary relationships.

 $ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode sparse --Score_calculation Sequence

'-g' or '--Group_id' option {DIP}

When this option is used, the algorithm tries to minimize the score between the sequences in intra-group and maximize the score between the sequences in inter-group. You can specify which sequences are same group by inputting the file path of text file, as shown below. You can specify commas, spaces or tab character as a separator in the text file. Please avoid to contain a sequence name in ID like as the combination of sequence name 'D12' and ID 'SeqID12'.
```
      Text file example:
          Sequence_name1 1
          Sequence_name2 2
          Sequence_name3 3
                     ... 
          Sequence_nameN 2
```

$ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode sparse --Score_calculation Sequence --Group_id SampleData/Group_id.txt

'--Fragment_size_pattern_matrix' and/or '--Fragment_start_position_matrix' option {DIP}

When you would like to switch 'Score_calculation mode' to another one (or sometimes generate new dendrograms), it is possible to reanalyze same dataset in less time than before by specifying Fragment_size_pattern_matrix.csv and/or Fragment_start_position_matrix that has already obtained after --Fragment_size_pattern_matrix and/or --Fragment_start_position_matrix option. When 'Fragment' mode will be used, you specify only 'Fragment_size_pattern_matrix.csv'. On the other hand, both 'Fragment_size_pattern_matrix.csv' and 'Fragment_start_position_matrix.csv' have to be specified, when 'Sequence' mode will be used.
'--Reference_tree' option {DIP}

Using '--Reference_tree' option, this tool calculates the Robinson-Foulds distance between each generated dendrogram and the reference tree, then sorts the results based on the similarity. When you have a reliable phylogenetic tree, this option makes it easy to select the primer set. The reference tree should be provided as either a character or a file path containing the phylogenetic tree in Newick format following the '--Reference_tree' option. Please note that all leaf names must correspond to the titles of the input sequences. To make things easier to understand, the titles of the sequences in the GenBank file have been changed from original titles in this tutorial.

$ shrs DIP -i SampleData/GenBank/ -o SampleData/ --Search_mode sparse --Reference_tree "(((G37,M6320),M2321),(M2288,M6282));"

'--Only_sequence_with_feature_key' option {DIP}

Using '--Only_sequence_with_feature_key' option, this tool generates the primer set based on sorely regions that are annotated with a feature key. At least one GenBank-format file should be used as an input file, as the information about the feature key is required in order to extract the target regions. When this option is specified, all GenBank-format input files are trimmed based on the feature key information. However, Fasta-format files are not trimmed. Note that it does not matter if GenBank and Fasta format files are intermingled in the input sequences, because this tool generates primer sets that are able to amplify all input sequences.

$ shrs DIP -i SampleData/GenBank/ -o SampleData/ --Search_mode sparse --Only_sequence_with_feature_key

Command-line options

Design identification primer sets {DIP}

shrs DIP -i <input file> [options]

Below is the full list of supported options for the DIP command line.

Option	Description
-i, --input_file	Input file path (required, format: FASTA or Genbank)
-e, --exclude_file	File path for exclusion sequence(s). Specify the sequence(s) file path of the bacteria if there are some bacteria that you would like to not amplify. (format: FASTA or Genbank)
-s, --primer_size	Primer size (default: 25)
-a, --allowance	Mismatch allowance ratio (default: 0.15. The value means that a 4-base [25 * 0.15] mismatch is accepted). Note that setting this parameter too large might causes the increased run time and excessive memory consumption.
-r, --range	Search range from primer size (default: 0. If the value is 1, the primer sets that have 25–26 base length are explored)
-d, --distance	The minimum distance between annealing sites that are hybridized with a primer (default: 10,000)
-o, --output	Output directory path (Make a new directory if the directory does not exist)
-P, --process	The number of processes (sometimes the number of CPU core) used for analysis
-g, --Group_id	Type the file path of the text that specifies which sequences are same group, after the '--Group_id' option. Please avoid to contain a sequence name in ID like as the combination of sequence name 'D12' and ID 'SeqID12'. (default: None.)
--Exclude_mode	Choose the method for excluding the sequence(s) from 'fast' or 'standard' when you specify some sequence(s) file path of the bacteria that you would like to exclude using '-e' option (default: fast.)
--Result_output	The upper limit of result output (default: 10,000)
--Cut_off_lower	The lower limit of amplicon size (default: 50)
--Cut_off_upper	The upper limit of amplicon size (default: 1,000)
--Match_rate	The ratio of trials for which the primer candidate fulfilled all criteria when the allowance value is decreased. The higher the number is, the more specific the primer obtained is (default: 0.8)
--Chunks	The chunk size for calculation. If the memory usage for calculation using the chunk size exceeds the GPU memory size, the processing unit for calculation will be switched to the CPU automatically (default: Auto)
--Maximum_annealing_site_number	The maximum acceptable value of the number of annealing site of the candidate of the primer in the input sequence (default: 5)
--Window_size	The duplicated candidates containing this window will be removed (default: 950)
--Search_mode	There are four options: exhaustive/moderate/sparse/manual. If you choose the 'manual' option, type the ratio to primer length after 'manual'. (e.g. --Search_mode manual 0.2.) (default: moderate when the average input sequence length is >5000, and exhaustive when the average input sequence length is ≤5000)
--withinMemory	All analyses are performed within memory (default: False)
--Without_allowance_adjustment	Use this option if you do not want to modify the allowance value for every homology calculation (default: False)
--circularDNA	If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--exclude_circularDNA	If there are some circular DNAs in the input sequence(s) that you do not want to amplify, use this option (default: n/a. It means all input sequences are linear DNA. When there is some circular DNA in the sequence file for exclusion, type 'all', 'individually', 'n/a', or the file path of the text that specifies which sequence is circularDNA, after the '--exclude_circularDNA' option.)
--Score_calculation	The calculation method of the score for identifying microorganisms. Fragment length or sequence. When the 'Sequence' is specified, the primer set that produces only a single amplicon will be obtained in order to reduce computational complexity.
--Combination_number	The number of primer sets to be used for identification (default: 3).
--Correlation_threshold	The primer sets with a correlation coefficient greater than this are grouped, and two or more primer sets from the same group are never chosen sets (default: 0.9).
--Dendrogram_output	The number supplied in this parameter will be used to construct dendrograms. As a result, the default parameters yield 10 dendrograms (default: 10, max: 100).
--Reference_tree	Use this option when evaluating a primer set based on Robinson-Foulds distance between the output and reference trees. After the '--Reference_tree' option, specify either a phylogenetic tree in Newick format or the file path to one. (default: None).
--Only_sequence_with_feature_key	This option should be used when designing primer sets based solely on the sequences with a feature key, such as a gene. (default: False).
--Fragment_size_pattern_matrix	When you have a csv file of fragment size pattern matrix, you can reanalyse from the csv file. Specify the file path (default: None.)
--Fragment_start_position_matrix	When you reanalyse from fragment size pattern matrix by 'Sequence' mode, specify the csv file path of fragment start position matrix (default: None.)

Design universal primer sets {DUP}

shrs DUP -i <input file> [options]

Below is the full list of supported options for the DUP command line.

Option	Description
-i, --input_file	Input file path (required, format: FASTA or Genbank)
-o, --output	Output directory path (Make a new directory if the directory does not exist)
-e, --exclude_file	File path for exclusion sequence(s). Specify the sequence(s) file path of the bacteria if there are some bacteria that you would like to not amplify. (format: FASTA or Genbank)
-s, --primer_size	Primer size (default: 20)
-a, --allowance	Mismatch allowance ratio (default: 0.20. The value means that a 4-base [20 * 0.20] mismatch is accepted). Note that setting this parameter too large might causes the increased run time and excessive memory consumption.
-r, --range	Search range from primer size (default: 0. If the value is 1, the primer sets that have 20–21 base length are explored)
-d, --distance	The minimum distance between annealing sites that are hybridized with a primer (default: 5,000)
-P, --process	The number of processes (sometimes the number of CPU core) used for analysis
--Exclude_mode	Choose the method for excluding the sequence(s) from 'fast' or 'standard' when you specify some sequence(s) file path of the bacteria that you would like to exclude using '-e' option (default: fast.)
--Cut_off_lower	The lower limit of amplicon size (default: 50)
--Cut_off_upper	The upper limit of amplicon size (default: 3,000)
--Match_rate	The ratio of trials for which the primer candidate fulfilled all criteria when the allowance value is decreased. The higher the number is, the more specific the primer obtained is (default: 0.0)
--Result_output	The upper limit of result output (default: 10,000)
--Omit_similar_fragment_size_pair	Use this option if you want to omit primer sets that amplify similar fragment lengths
--Window_size	The duplicated candidates containing this window will be removed (default: 50)
--Maximum_annealing_site_number	The maximum acceptable value of the number of annealing site of the candidate of the primer in the input sequence (default: unlimited)
--Chunks	The chunk size for calculation. If the memory usage for calculation using the chunk size exceeds the GPU memory size, the processing unit for calculation will be switched to the CPU automatically (default: Auto)
--Search_mode	There are four options: exhaustive/moderate/sparse/manual. If you choose the 'manual' option, type the ratio to primer length after 'manual'. (e.g. --Search_mode manual 0.2.) (default: moderate when the average input sequence length is >5000, and exhaustive when the average input sequence length is ≤5000)
--withinMemory	All analyses are performed within memory (default: False)
--Without_allowance_adjustment	Use this option if you do not want to modify the allowance value for every homology calculation (default: False)
--circularDNA	If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--exclude_circularDNA	If there are some circular DNAs in the input sequence(s) that you do not want to amplify, use this option (default: n/a. It means all input sequences are linear DNA. When there is some circular DNA in the sequence file for exclusion, type 'all', 'individually', 'n/a', or the file path of the text that specifies which sequence is circularDNA, after the '--exclude_circularDNA' option.)
--Fragment_size_pattern_matrix	When you have a csv file of fragment size pattern matrix, you can reanalyse from the csv file. Specify the file path (default: None.)

Input Sequence Preprocessing {ISP}

shrs ISP -i <input file> [options]

Below is the full list of supported options for the ISP command line.

Option	Description
-i, --input_file	Input file path (required, format: FASTA or Genbank)
-o, --output	Output directory path (Make a new directory if the directory does not exist)
--circularDNA	If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--circularDNAoverlap	Maximum value of overlapped region in circular DNA (default: 10,000). For reducing computational complexity, you can reduce this value to the larger one of the upper limit of amplicon size and interval distance that will be set by '--Cut_off_upper' and '--distance' in following DIP or DUP.
--Single_target	All input files will be concatenated and generated as one sequence if you use this option, even if separated multi-FASTA files are inputted.
--Multiple_targets	All input files are recognized as an individual file, and every file is preprocessed separately.
--Single_file	When you use the '--Multiple_targets' option and this option, a preprocessed sequence file will be outputted as one multi-FASTA file.

Additional Analysis {AA}

shrs AA -i <input file> -f <csv file> [options]

Below is the full list of supported options for the AA command line.

Option	Description
-i, --input_file	Input file path (required, format: FASTA or Genbank)
-f, --csv_file	File path of the CSV data obtained from other analysis (DIP or DUP) (required)
-o, --output	Output directory path (Make a new directory if the directory does not exist)
-s, --size_limit	The upper limit of amplicon size (default: 3,000)
-P, --process	The number of processes (sometimes the number of CPU core) used for analysis
-fwd, --forward	Forward primer sequence (required if you don't provide a CSV file with the '-f' option)
-rev, --reverse	Reverse primer sequence (required if you don't provide a CSV file with the '-f' option)
--circularDNA	If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--warning	Shows all warnings when you use this option

in silico PCR {iPCR}

shrs iPCR -i <input file> -fwd <forward primer sequence> -rev <reverse primer sequence> [options]

Below is the full list of supported options for the iPCR command line.

Option	Description
-i, --input_file	Input file path (required, format: FASTA or Genbank)
-o, --output	Output directory path (Make a new directory if the directory does not exist)
-s, --size_limit	The upper limit of amplicon size (default: 10,000)
-P, --process	The number of processes (sometimes the number of CPU core) used for analysis
-fwd, --forward	The forward primer sequence used for amplification (required)
-rev, --reverse	The reverse primer sequence used for amplification (required)
-f, --primerset_filepath	The filepath of the text file containing forward and reverse primer sequence (required if forward and reverse primer set with -fwd and -rev option are not provided)
--fasta	Output format. A FASTA file will be generated if you use this option.
--Single_file	Output format. One single FASTA-format file will be generated even if you input some separate FASTA files, when using this option with the '--fasta' option.
--Mismatch_allowance	The acceptable mismatch number (default: 0)
--Only_one_amplicon	Only one amplicon is outputted, even if multiple amplicons are obtained by PCR when you use this option.
--Position_index	The result has the information of the amplification position when this option is enabled.
--circularDNA	If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--gene_annotation_search_range	The gene annotation search range in the GenBank-format file. (default: 100)
--LowQualitySequences	In in silico PCR analysis, all sequences containing the regions with a high proportion of 'N' bases will be omitted when you specify 'remove' option. This option helps to reduce a computational effort and calculation time. If you select the omitted sequences individually, specify 'individually'. When the 'ignore' option is selected, regions spanning 'N' bases will not be amplified. To use all input sequences for the in silico PCR template, specify the 'remain' option. (Default: 'remain')
--Annotation	If the input sequence file is in GenBank format, the amplicon(s) is annotated automatically.
--warning	Shows all warnings when you use this option

Design probes {DP}

shrs DP -i <input file> [options]

Below is the full list of supported options for the DP command line.

Option	Description
-i, --input_file	Input file path (required, format: FASTA or Genbank)
-e, --exclude_file	File path for exclusion sequence(s). Specify the sequence(s) file path of the bacteria if there are some bacteria that you would like to not hybridize. (format: FASTA or Genbank)
-s, --probe_size	Probe size (default: 25)
-a, --allowance	Mismatch allowance ratio (default: 0.25. The value means that a 7-base [25 * 0.25] mismatch is accepted). Note that setting this parameter too large might causes the increased run time and excessive memory consumption.
-r, --range	Search range from probe size (default: 0. If the value is 1, the probe that have 25–26 base length is explored)
-d, --distance	The minimum distance between annealing sites that are hybridized with a probe (default: 100)
-o, --output	Output directory path (Make a new directory if the directory does not exist)
-P, --process	The number of processes (sometimes the number of CPU core) used for analysis
--Exclude_mode	Choose the method for excluding the sequence(s) from 'fast' or 'standard' when you specify some sequence(s) file path of the bacteria that you would like to exclude using '-e' option (default: fast.)
--Result_output	The upper limit of result output (default: 10,000)
--Match_rate	The ratio of trials for which the probe candidate fulfilled all criteria when the allowance value is decreased. The higher the number is, the more specific the probe obtained is (default: 0.8)
--Chunks	The chunk size for calculation. If the memory usage for calculation using the chunk size exceeds the GPU memory size, the processing unit for calculation will be switched to the CPU automatically (default: Auto)
--Maximum_annealing_site_number	The maximum acceptable value of the number of annealing site of the candidate of the probe in the input sequence (default: 5)
--Search_mode	There are four options: exhaustive/moderate/sparse/manual. If you choose the 'manual' option, type the ratio to probe length after 'manual'. (e.g. --Search_mode manual 0.2.) (default: moderate when the average input sequence length is >5000, and exhaustive when the average input sequence length is ≤5000)
--withinMemory	All analyses are performed within memory (default: False)
--Without_allowance_adjustment	Use this option if you do not want to modify the allowance value for every homology calculation (default: False)

Functions contained in library of `shrs`

shrslib.basicfunc
- class nucleotide_sequence
- complementary_sequence
- calculate_Tm_value
- read_sequence_file
shrslib.explore
- search_position
- PCR_amplicon
shrslib.scores
- calculate_flexibility
- calculate_score
- calculate_diff_length_score
- fragment_size_distance
- array_diff
- sequence_duplicated

See documentation for more detailed information.

Citing SHRS

Please cite the following article:

Takahashi, M., Morikawa, K., Akao, T., 2022. Short-length Homologous Region exhaustive Search algorithm (SHRS): A primer design algorithm for differentiating bacteria at the species, subspecies, or strain level based on a whole genome sequence. J. Microbiol. Methods 203, 106605. DOI: 10.1016/j.mimet.2022.106605

Update information

Version 0.10.0: The argument '--Group_id' has been added to DIP mode.

Version 0.11.0: New ability for designing a probe has been added.

Version 0.12.0: The argument '-f, --primerset_filepath' has been added to iPCR mode.

Version 0.13.0: The '--Reference_tree' and '--Only_sequence_with_feature_key' arguments have been added to the DIP mode. The '--Reference_tree' option sorts the results by Robinson-Foulds distance in comparison to the provided reference tree. The '--Only_sequence_with_feature_key' option generates primer sets based on regions annotated with a feature key (For more information, see Tips section above). The default value of the number of processors used in in silico PCR analysis has been changed to '1'.

Reference

The sequences used in this tutorial are below. All sequences have been downloaded from INSDC member database and RefSeq database.

Strain name	Accession	Format
Mycoplasma genitalium G37	L43967	GenBank
Mycoplasma genitalium G37	AAGX01000001 - AAGX01000025	Fasta
Mycoplasma genitalium M2288	CP003773	GenBank
Mycoplasma genitalium M2321	CP003770	GenBank & Fasta
Mycoplasma genitalium M6282	NC_018496	GenBank & Fasta
Mycoplasma genitalium M6320	CP003772	GenBank & Fasta
Mycoplasma pneumoniae NCTC10119	NZ_LR214945	Fasta

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
- Python :: 3.8
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

This version

0.13.2

Nov 4, 2025

0.13.1

Oct 30, 2025

0.13.0

Aug 26, 2025

0.12.0

Feb 26, 2024

0.11.1

Aug 15, 2023

0.11.0

Aug 15, 2023

0.10.4

Dec 23, 2022

0.10.3

Nov 4, 2022

0.10.2

Nov 2, 2022

0.10.1

Nov 2, 2022

0.10.0

Oct 25, 2022

0.9.7

Oct 13, 2022

0.9.6

Sep 14, 2022

0.9.5

Aug 16, 2022

0.9.4

Jul 21, 2022

0.9.3

Jul 15, 2022

0.9.2

Jun 17, 2022

0.9.1

May 17, 2022

0.8.4

Mar 16, 2022

0.8.3

Mar 16, 2022

0.8.2

Mar 13, 2022

0.8.1

Mar 11, 2022

0.0.1

Mar 11, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shrs-0.13.2.tar.gz (112.6 kB view details)

Uploaded Nov 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

shrs-0.13.2-py3-none-any.whl (109.4 kB view details)

Uploaded Nov 4, 2025 Python 3

File details

Details for the file shrs-0.13.2.tar.gz.

File metadata

Download URL: shrs-0.13.2.tar.gz
Upload date: Nov 4, 2025
Size: 112.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for shrs-0.13.2.tar.gz
Algorithm	Hash digest
SHA256	`9bde663c76bb986365a75533920e8268dea7f8e80578fa2d8bab878c2646c567`
MD5	`bf28dbec529ed1f53106dbf2d77c60ca`
BLAKE2b-256	`ea062f5f51eb0c20a806c3bb615cfd943396a008f15b187e5dee0ecf4478af27`

See more details on using hashes here.

File details

Details for the file shrs-0.13.2-py3-none-any.whl.

File metadata

Download URL: shrs-0.13.2-py3-none-any.whl
Upload date: Nov 4, 2025
Size: 109.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for shrs-0.13.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8afbc3d31389d3a32d015a0c5d28018dcfa9a41595042b537f229d4b7ca3a83`
MD5	`84b6c87914b3858a42635bfa764d60fc`
BLAKE2b-256	`86c529e1c41cbac9218db09d6665503ecc123cc9c983f2d77d4adad27e643baf`

See more details on using hashes here.

shrs 0.13.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Project description

Table of Contents

Overview

Environment

Installation

pip

Installing from the source code

Download source code

Installation from source

How to run shrs

Sample data

Workflow for design primer

1. Design primer sets for identification without input data preprocessing.

2. Design universal primer sets without input data preprocessing.

3. Design primer sets for identification with input data preprocessing.

4. Design universal primer sets with input data preprocessing.

Re-analysis/Additional analysis

Design probes

in silico PCR

Tips: Some command line options

Command-line options

Functions contained in library of shrs

Citing SHRS

Update information

Reference

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`pip`

How to run `shrs`

Functions contained in library of `shrs`