Skip to main content

This package contains the programs that design primer set for microbiological analysis and perform some accessory analysis.

Project description

Project description

shrs

Primer design method based on short length homologous region exhaustive search algorithm.


Table of Contents


Overview

This package contains the programs that design a primer set for the analysis of bacteria and perform some accessory analyses. This package contains the following six subcommands.

  • AA (For Additional analysis)
  • DIP (For design identification primer sets)
  • DUP (For design universal primer sets)
  • ISP (For input sequence preprocessing)
  • iPCR (For insilico PCR)
  • DP (For design probes)

You can get a summary of the available command-line options of each command by using the following command: shrs, shrs {subcommand} -h. Every program accepts FASTA or GenBank format sequence files. The input file must be encoded in UTF-8 format.


Environment

This package passed the operation verification under Windows 10, Windows 11, macOS Sequoia, Ubuntu 18.04, Ubuntu 20.04 and Ubuntu 22.04. When the 'cupy' package is available, some calculations will be performed using the GPU. The 'cupy' package is NOT installed automatically, even if you use the pip install shrs command. Please install the version of the 'cupy' package that matches the version of CUDA on your computer by yourself. Type pip install shrs[GPU] if you would like to install 'cupy' packages automatically. Analyses by a PC with a GPU are highly recommended. The processing time when a GPU is used could be less than that when a CPU is used.


Installation

pip

Type the following command.

 $ pip install shrs

or

 $ pip install shrs[GPU]

Please check above. (Environment section)

Installing from the source code

Download source code

Download the source from the following URL. https://pypi.org/project/shrs/#files

Installation from source

Place the source file on the current working directory, and then type the following commands.

 $ tar -xvzf shrs-0.13.2.tar.gz
 $ cd shrs-0.13.2
 $ python setup.py build
 $ python setup.py install

How to run shrs

To confirm a subcommand, type shrs. The following message will be shown.

usage: shrs [-h] [-v] {AA,DIP,DUP,ISP,iPCR,DP} ...

--- HELP message ---

positional arguments:
  {AA,DIP,DUP,ISP,iPCR,DP}
    AA                  see 'AA -h'. Make a fragment size matrix from a new template sequence and the result containing primer sets that you have already generated.
    DIP                 see 'DIP -h'. Primer design algorithm for the identification of bacteria
    DUP                 see 'DUP -h'. Primer design algorithm for universal primer
    ISP                 see 'ISP -h'. Input sequences preprocessing program for DUP or DIP
    iPCR                see 'iPCR -h'. In silico PCR amplification algorithm
    DP                  see 'DP -h'. Probe design algorithm

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Sample data

Sample data can be downloaded from here

Algorithm Hash value
MD5 9b40802495ce31bec85b5f541bd6a9d0
SHA1 ac84bc66e35c126f929d73c444846bec75d75920
SHA256 da69337717cc4bc101e3474e71b1884d6ba42b8f06772d113f30ff3664cb857e

Sample data contain a multi-FASTA file of complete genome DNAs of three strains of Mycoplasma genitalium, a FASTA file of the complete genome DNA of one strain of Mycoplasma pneumoniae, a multi-FASTA file of 25 contig sequences of Mycoplasma genitalium G37 and GenBank files of the complete genome DNAs of five strains of Mycoplasma genitalium. Place SampleData in the current working directory.

Workflow for design primer

Workflow

1. Design primer sets for identification without input data preprocessing.

Specify the multi-FASTA file or the folder containing target sequences (FASTA or GenBank format) as input file(s) after the '-i' option.

 $ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode sparse

It takes approx. 20 minutes to 2 hours, depending on the PC's performance. After the analysis, the results will be output in a SampleData/Result/(Timestamp)identify_primer_set/ folder. The CSV file contains a results table as shown below with the analysis parameters.

Arguments
Input_file_name sample.fasta
Target Strain1, Strain2, ... , Strain N
Exclude_file_name None
Exclude
Probe_size_range_description 23-25
allowance 0.15
cut_off_lower 50
cut_off_upper 1000
Interval_distance 10000
Match_rate 0.8
Result_output 10000
Search_mode sparse
Window_size 950
Maximum_annealing_site_number 5
Score_calculation_mode Fragment
Search_area Whole genome sequence
Reference_tree Not provided
Primer1 Primer2 No. Input sequence1 Input sequence2 ... Input sequenceN Score Fragment number Forward Tm_value Reverse Tm_value Tm_value difference
ATG..AA AGC..TA 1 [(655, 1.0)] [(528, 1.0)] ... [(411, 1.0)] 1200 N 60.0 61.5 1.5
TAG..TC GCC..TG 2 [(344, 1.0)] [(257, 1.0)] ... [(499, 1.0)] 1200 N 58.0 60.5 2.5

Input sequence columns indicate the amplicon length amplified by Primers 1 and 2 and its ratio to the total amplicon number [(Amplicon length, Ratio)]. Therefore, if a primer set produces three amplicons (355 bp, 355 bp, and 652 bp) from whole genome DNA, the results would be [(355, 0.67), (652, 0.33)]. Each primer set can be used for identification; however, combining primer sets will make the identification easier and more accurate. When the analysis is run with default settings, three primer sets are selected, and a dendrogram based on the information of fragments amplified by each primer set will be generated in the 'Dendrogram' folder. The 'Combination_number' option (default: 3) can be used to adjust the number of primer sets used for identification. This algorithm DIP designs primer sets that produce fragments from all input sequences and maximizes the differences in the amplicon size or amplicon sequence among input sequences. To confirm command-line options, type shrs DIP -h.

2. Design universal primer sets without input data preprocessing.

Specify the Multi-FASTA or the folder containing the target sequences (FASTA or GenBank format) as input file(s) after the '-i' option. In the following case, the design primer set can amplify genome DNAs of three strains of Mycoplasma genitalium and NOT produce any amplicons from genome DNA of one strain of Mycoplasma pneumoniae.

 $ shrs DUP -i SampleData/Mgenitalium_3strains.fasta -e SampleData/Mpneumoniae.fasta -a 0.25 -o SampleData/ --Search_mode sparse

It takes approx. 5–10 minutes, depending on the PC's performance. After the analysis, the results will be output in a SampleData/Result/(Timestamp)universal_primer_set/ folder. The format of the outputted CSV file is almost the same as that obtained for the results of a DIP analysis as mentioned above (see Procedure 1). This algorithm DUP designs primer sets that produce fragments from all input sequences and minimizes the differences in the amplicon size among input sequences. Note that the primer sets are designed to produce only one amplicon from each input sequence when shrs DUP is used.

To confirm command-line options, type shrs DUP -h.

3. Design primer sets for identification with input data preprocessing.

Each sequence that is input into the DIP subcommand at the same time is processed for differentiation from each other. Therefore, if an inputted multi-FASTA file has plasmid or multiple contig sequences of an identical strain, the primer set obtained will produce some amplicons that correspond to inputted contigs or plasmid. To avoid this problem, if genome DNA of an identical strain has been divided into multiple contigs, all contigs and the plasmid should be preprocessed (concatenated) prior to the DIP analysis, using the following subcommand. One multi-FASTA file for preprocessing must contain the contigs and plasmid derived from a single strain. Organize all multi-FASTA files in one folder when there are some multi-FASTA files, and then specify the folder path as the input file path.

 $ shrs ISP -i SampleData/contig/ -o SampleData/Preprocessed_data1/ --Single_file

A preprocessed multi-FASTA file will be generated, and then the preprocessed multi-FASTA file is analyzed by shrs DIP.

 $ shrs DIP -i SampleData/Preprocessed_data1/ -o SampleData/ --Search_mode sparse

This algorithm DIP designs primer sets that produce fragments from all input sequences and maximizes the differences in the amplicon size or amplicon sequence among input sequences.

To confirm command-line options, type shrs DIP -h or shrs ISP -h.

4. Design universal primer sets with input data preprocessing.

As mentioned above, if genome DNA of an identical strain has been divided into multiple contigs, all contigs and the plasmid should be preprocessed (concatenated) prior to a DUP analysis by using following subcommand. One multi-FASTA file for preprocessing must contain contigs and plasmid derived from a single strain. Organize all multi-FASTA files in one folder when there are some multi-FASTA files, and then specify the folder path as the input file path.

 $ shrs ISP -i SampleData/contig/ -o SampleData/Preprocessed_data2/ --Single_file

A preprocessed multi-FASTA file will be generated, and then the preprocessed multi-FASTA file is analyzed by shrs DUP.

 $ shrs DUP -i SampleData/Preprocessed_data2 -e SampleData/Mpneumoniae.fasta -a 0.20 -o SampleData/ --Search_mode sparse

This algorithm DUP designs primer sets that produce fragments from all input sequences and minimizes the differences in the amplicon size among input sequences. Note that the primer sets are designed to produce only one amplicon from each input sequence when shrs DUP is used.

To confirm command-line options, type shrs DUP -h or shrs ISP -h.

Re-analysis/Additional analysis

New sequences can be analyzed based on primer sets in a CSV file obtained from shrs DIP or shrs DUP. Delete the rows containing unnecessary primer sets, since an analysis of them would take a long time. Do not include blank rows between the rows containing primer sets.

 $ shrs AA -i SampleData/Mgenitalium_M6282.fasta -f SampleData/Previous_DIP_Result.csv -o SampleData/New_result/

Design probes

Specify the multi-FASTA file or the folder containing target sequences (FASTA or GenBank format) as input file(s) after the '-i' option. This algorithm DP designs probes that hybridize to all input sequences. Input sequence columns in result table indicate the annealing positions on each input sequence.

in silico PCR

You can get amplicon information by using in silico PCR.

 $ shrs iPCR -i SampleData/GenBank/ -fwd CTACACCCATTTACCCAACAGTATC -rev TCTCATTGGWAACATTCGGTACATC -o SampleData/insilicoPCR_Result/

A CSV file will be output with default settings. If you prefer FASTA over CSV, use the '--fasta' option. When you need to know the primer annealing site positions on a template sequence, use the '--Position_index' option. You can obtain the annotation information of amplicons, when a GenBank file is analyzed by in silico PCR with the '--Annotation' option.

 $ shrs iPCR -i SampleData/GenBank/ -fwd CTACACCCATTTACCCAACAGTATC -rev TCTCATTGGWAACATTCGGTACATC -o SampleData/insilicoPCR_annotation_Result/ --Annotation

If there are some primer sets, you can input a filepath of a text file that contains primer sets after -f option instead of typing primer sequences after -fwd and -rev option. The text file contains one primer set in each row and must be written in with utf-8 encoding. You can specify commas, spaces or tab character as a separator in the text file.

        Text file example:
            Forward_Primer_sequence1,Reverse_Primer_sequence1
            Forward_Primer_sequence2,Reverse_Primer_sequence2
            Forward_Primer_sequence3,Reverse_Primer_sequence3
                       ... 
            Forward_Primer_sequenceN,Reverse_Primer_sequenceN

Tips: Some command line options

  • '--circularDNA'/'--exclude_circularDNA' option {AA, DIP, DUP, ISP, iPCR}

    When input sequence file(s) contains one or more circular DNA sequence(s), use this option. Specify which sequence is circular DNA by using a subargument. All input sequences are processed as linear DNA in default settings ('n/a'). When all input sequences are circular DNA, use 'all' after the --circularDNA/--exclude_circularDNA option. For each individual sequence, you can specify whether the sequence is circular DNA or not by using 'individually'. You can also specify whether or not a sequence is circular DNA by inputting the file path of text file, as shown below.

          Text file example:
              Sequence_name1 circularDNA
              Sequence_name2 linearDNA
              Sequence_name3 linearDNA
                         ... 
              Sequence_nameN circularDNA
    
 $ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode sparse --circularDNA SampleData/circularDNA.txt
  • '--Search_mode' option {DIP, DUP}

    The '--Search_mode' option contains four choices: 'exhaustive', 'moderate', 'sparse', and 'manual'. Primer candidates will be generated by cutting every N-base from the reference sequence. The N-value is different depending on each Search_mode.

    Subargument N-bases
    exhaustive 1 base. Same as N-gram method (N: primer size)
    moderate One-third of primer size (Maximum: 10 bases)
    sparse primer size
    manual primer size * Ratio inputed by user
 $ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode manual 2
  • '-a' option {DIP, DUP}

    When the '-e' and '--Exclude_mode fast' options are used, the larger the value allowance is, the fewer in number the candidates will be.

  • '--Score_calculation' option {DIP}

    When 'Sequence' is selected in this option, the cumulative score of sequence homology among amplicons is used as an indicator to evaluate primer set candidates instead of the score calculated from fragment length. When this option is used, only primer sets that produce a single amplicon from the template sequence will be obtained because calculation of a sequence homology takes a long time. Additionally, a multiple alignment tool (MAFFT program) is required to generate a dendrogram when 'Sequence' is specified in 'Score_calculation' option. Before running this script, please install MAFFT and add the binary of MAFFT to your PATH. The dendrogram is constructed based on the UPGMA method, and the distance matrix calculated by the identity matrix. The gap(s) in the alignment(s) are used to calculate distance (NOT ignored). Note that the dendrogram obtained indicates the similarity of alignment sequences containing gap(s), and it does not always indicate the evolutionary relationships.

 $ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode sparse --Score_calculation Sequence
  • '-g' or '--Group_id' option {DIP}

    When this option is used, the algorithm tries to minimize the score between the sequences in intra-group and maximize the score between the sequences in inter-group. You can specify which sequences are same group by inputting the file path of text file, as shown below. You can specify commas, spaces or tab character as a separator in the text file. Please avoid to contain a sequence name in ID like as the combination of sequence name 'D12' and ID 'SeqID12'.

          Text file example:
              Sequence_name1 1
              Sequence_name2 2
              Sequence_name3 3
                         ... 
              Sequence_nameN 2
    
$ shrs DIP -i SampleData/Mgenitalium_3strains.fasta -o SampleData/ --Search_mode sparse --Score_calculation Sequence --Group_id SampleData/Group_id.txt
  • '--Fragment_size_pattern_matrix' and/or '--Fragment_start_position_matrix' option {DIP}

    When you would like to switch 'Score_calculation mode' to another one (or sometimes generate new dendrograms), it is possible to reanalyze same dataset in less time than before by specifying Fragment_size_pattern_matrix.csv and/or Fragment_start_position_matrix that has already obtained after --Fragment_size_pattern_matrix and/or --Fragment_start_position_matrix option. When 'Fragment' mode will be used, you specify only 'Fragment_size_pattern_matrix.csv'. On the other hand, both 'Fragment_size_pattern_matrix.csv' and 'Fragment_start_position_matrix.csv' have to be specified, when 'Sequence' mode will be used.

  • '--Reference_tree' option {DIP}

    Using '--Reference_tree' option, this tool calculates the Robinson-Foulds distance between each generated dendrogram and the reference tree, then sorts the results based on the similarity. When you have a reliable phylogenetic tree, this option makes it easy to select the primer set. The reference tree should be provided as either a character or a file path containing the phylogenetic tree in Newick format following the '--Reference_tree' option. Please note that all leaf names must correspond to the titles of the input sequences. To make things easier to understand, the titles of the sequences in the GenBank file have been changed from original titles in this tutorial.

$ shrs DIP -i SampleData/GenBank/ -o SampleData/ --Search_mode sparse --Reference_tree "(((G37,M6320),M2321),(M2288,M6282));"
  • '--Only_sequence_with_feature_key' option {DIP}

    Using '--Only_sequence_with_feature_key' option, this tool generates the primer set based on sorely regions that are annotated with a feature key. At least one GenBank-format file should be used as an input file, as the information about the feature key is required in order to extract the target regions. When this option is specified, all GenBank-format input files are trimmed based on the feature key information. However, Fasta-format files are not trimmed. Note that it does not matter if GenBank and Fasta format files are intermingled in the input sequences, because this tool generates primer sets that are able to amplify all input sequences.

$ shrs DIP -i SampleData/GenBank/ -o SampleData/ --Search_mode sparse --Only_sequence_with_feature_key

Command-line options

Design identification primer sets {DIP}
shrs DIP -i <input file> [options]

Below is the full list of supported options for the DIP command line.

Option Description
-i, --input_file Input file path (required, format: FASTA or Genbank)
-e, --exclude_file File path for exclusion sequence(s). Specify the sequence(s) file path of the bacteria if there are some bacteria that you would like to not amplify. (format: FASTA or Genbank)
-s, --primer_size Primer size (default: 25)
-a, --allowance Mismatch allowance ratio (default: 0.15. The value means that a 4-base [25 * 0.15] mismatch is accepted). Note that setting this parameter too large might causes the increased run time and excessive memory consumption.
-r, --range Search range from primer size (default: 0. If the value is 1, the primer sets that have 25–26 base length are explored)
-d, --distance The minimum distance between annealing sites that are hybridized with a primer (default: 10,000)
-o, --output Output directory path (Make a new directory if the directory does not exist)
-P, --process The number of processes (sometimes the number of CPU core) used for analysis
-g, --Group_id Type the file path of the text that specifies which sequences are same group, after the '--Group_id' option. Please avoid to contain a sequence name in ID like as the combination of sequence name 'D12' and ID 'SeqID12'. (default: None.)
--Exclude_mode Choose the method for excluding the sequence(s) from 'fast' or 'standard' when you specify some sequence(s) file path of the bacteria that you would like to exclude using '-e' option (default: fast.)
--Result_output The upper limit of result output (default: 10,000)
--Cut_off_lower The lower limit of amplicon size (default: 50)
--Cut_off_upper The upper limit of amplicon size (default: 1,000)
--Match_rate The ratio of trials for which the primer candidate fulfilled all criteria when the allowance value is decreased. The higher the number is, the more specific the primer obtained is (default: 0.8)
--Chunks The chunk size for calculation. If the memory usage for calculation using the chunk size exceeds the GPU memory size, the processing unit for calculation will be switched to the CPU automatically (default: Auto)
--Maximum_annealing_site_number The maximum acceptable value of the number of annealing site of the candidate of the primer in the input sequence (default: 5)
--Window_size The duplicated candidates containing this window will be removed (default: 950)
--Search_mode There are four options: exhaustive/moderate/sparse/manual. If you choose the 'manual' option, type the ratio to primer length after 'manual'. (e.g. --Search_mode manual 0.2.) (default: moderate when the average input sequence length is >5000, and exhaustive when the average input sequence length is ≤5000)
--withinMemory All analyses are performed within memory (default: False)
--Without_allowance_adjustment Use this option if you do not want to modify the allowance value for every homology calculation (default: False)
--circularDNA If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--exclude_circularDNA If there are some circular DNAs in the input sequence(s) that you do not want to amplify, use this option (default: n/a. It means all input sequences are linear DNA. When there is some circular DNA in the sequence file for exclusion, type 'all', 'individually', 'n/a', or the file path of the text that specifies which sequence is circularDNA, after the '--exclude_circularDNA' option.)
--Score_calculation The calculation method of the score for identifying microorganisms. Fragment length or sequence. When the 'Sequence' is specified, the primer set that produces only a single amplicon will be obtained in order to reduce computational complexity.
--Combination_number The number of primer sets to be used for identification (default: 3).
--Correlation_threshold The primer sets with a correlation coefficient greater than this are grouped, and two or more primer sets from the same group are never chosen sets (default: 0.9).
--Dendrogram_output The number supplied in this parameter will be used to construct dendrograms. As a result, the default parameters yield 10 dendrograms (default: 10, max: 100).
--Reference_tree Use this option when evaluating a primer set based on Robinson-Foulds distance between the output and reference trees. After the '--Reference_tree' option, specify either a phylogenetic tree in Newick format or the file path to one. (default: None).
--Only_sequence_with_feature_key This option should be used when designing primer sets based solely on the sequences with a feature key, such as a gene. (default: False).
--Fragment_size_pattern_matrix When you have a csv file of fragment size pattern matrix, you can reanalyse from the csv file. Specify the file path (default: None.)
--Fragment_start_position_matrix When you reanalyse from fragment size pattern matrix by 'Sequence' mode, specify the csv file path of fragment start position matrix (default: None.)
Design universal primer sets {DUP}
shrs DUP -i <input file> [options]

Below is the full list of supported options for the DUP command line.

Option Description
-i, --input_file Input file path (required, format: FASTA or Genbank)
-o, --output Output directory path (Make a new directory if the directory does not exist)
-e, --exclude_file File path for exclusion sequence(s). Specify the sequence(s) file path of the bacteria if there are some bacteria that you would like to not amplify. (format: FASTA or Genbank)
-s, --primer_size Primer size (default: 20)
-a, --allowance Mismatch allowance ratio (default: 0.20. The value means that a 4-base [20 * 0.20] mismatch is accepted). Note that setting this parameter too large might causes the increased run time and excessive memory consumption.
-r, --range Search range from primer size (default: 0. If the value is 1, the primer sets that have 20–21 base length are explored)
-d, --distance The minimum distance between annealing sites that are hybridized with a primer (default: 5,000)
-P, --process The number of processes (sometimes the number of CPU core) used for analysis
--Exclude_mode Choose the method for excluding the sequence(s) from 'fast' or 'standard' when you specify some sequence(s) file path of the bacteria that you would like to exclude using '-e' option (default: fast.)
--Cut_off_lower The lower limit of amplicon size (default: 50)
--Cut_off_upper The upper limit of amplicon size (default: 3,000)
--Match_rate The ratio of trials for which the primer candidate fulfilled all criteria when the allowance value is decreased. The higher the number is, the more specific the primer obtained is (default: 0.0)
--Result_output The upper limit of result output (default: 10,000)
--Omit_similar_fragment_size_pair Use this option if you want to omit primer sets that amplify similar fragment lengths
--Window_size The duplicated candidates containing this window will be removed (default: 50)
--Maximum_annealing_site_number The maximum acceptable value of the number of annealing site of the candidate of the primer in the input sequence (default: unlimited)
--Chunks The chunk size for calculation. If the memory usage for calculation using the chunk size exceeds the GPU memory size, the processing unit for calculation will be switched to the CPU automatically (default: Auto)
--Search_mode There are four options: exhaustive/moderate/sparse/manual. If you choose the 'manual' option, type the ratio to primer length after 'manual'. (e.g. --Search_mode manual 0.2.) (default: moderate when the average input sequence length is >5000, and exhaustive when the average input sequence length is ≤5000)
--withinMemory All analyses are performed within memory (default: False)
--Without_allowance_adjustment Use this option if you do not want to modify the allowance value for every homology calculation (default: False)
--circularDNA If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--exclude_circularDNA If there are some circular DNAs in the input sequence(s) that you do not want to amplify, use this option (default: n/a. It means all input sequences are linear DNA. When there is some circular DNA in the sequence file for exclusion, type 'all', 'individually', 'n/a', or the file path of the text that specifies which sequence is circularDNA, after the '--exclude_circularDNA' option.)
--Fragment_size_pattern_matrix When you have a csv file of fragment size pattern matrix, you can reanalyse from the csv file. Specify the file path (default: None.)
Input Sequence Preprocessing {ISP}
shrs ISP -i <input file> [options]

Below is the full list of supported options for the ISP command line.

Option Description
-i, --input_file Input file path (required, format: FASTA or Genbank)
-o, --output Output directory path (Make a new directory if the directory does not exist)
--circularDNA If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--circularDNAoverlap Maximum value of overlapped region in circular DNA (default: 10,000). For reducing computational complexity, you can reduce this value to the larger one of the upper limit of amplicon size and interval distance that will be set by '--Cut_off_upper' and '--distance' in following DIP or DUP.
--Single_target All input files will be concatenated and generated as one sequence if you use this option, even if separated multi-FASTA files are inputted.
--Multiple_targets All input files are recognized as an individual file, and every file is preprocessed separately.
--Single_file When you use the '--Multiple_targets' option and this option, a preprocessed sequence file will be outputted as one multi-FASTA file.
Additional Analysis {AA}
shrs AA -i <input file> -f <csv file> [options]

Below is the full list of supported options for the AA command line.

Option Description
-i, --input_file Input file path (required, format: FASTA or Genbank)
-f, --csv_file File path of the CSV data obtained from other analysis (DIP or DUP) (required)
-o, --output Output directory path (Make a new directory if the directory does not exist)
-s, --size_limit The upper limit of amplicon size (default: 3,000)
-P, --process The number of processes (sometimes the number of CPU core) used for analysis
-fwd, --forward Forward primer sequence (required if you don't provide a CSV file with the '-f' option)
-rev, --reverse Reverse primer sequence (required if you don't provide a CSV file with the '-f' option)
--circularDNA If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--warning Shows all warnings when you use this option
in silico PCR {iPCR}
shrs iPCR -i <input file> -fwd <forward primer sequence> -rev <reverse primer sequence> [options]

Below is the full list of supported options for the iPCR command line.

Option Description
-i, --input_file Input file path (required, format: FASTA or Genbank)
-o, --output Output directory path (Make a new directory if the directory does not exist)
-s, --size_limit The upper limit of amplicon size (default: 10,000)
-P, --process The number of processes (sometimes the number of CPU core) used for analysis
-fwd, --forward The forward primer sequence used for amplification (required)
-rev, --reverse The reverse primer sequence used for amplification (required)
-f, --primerset_filepath The filepath of the text file containing forward and reverse primer sequence (required if forward and reverse primer set with -fwd and -rev option are not provided)
--fasta Output format. A FASTA file will be generated if you use this option.
--Single_file Output format. One single FASTA-format file will be generated even if you input some separate FASTA files, when using this option with the '--fasta' option.
--Mismatch_allowance The acceptable mismatch number (default: 0)
--Only_one_amplicon Only one amplicon is outputted, even if multiple amplicons are obtained by PCR when you use this option.
--Position_index The result has the information of the amplification position when this option is enabled.
--circularDNA If there are some circular DNAs in the input sequences, use this option (default: n/a. It means all input sequences are linear DNA. When there are some circular DNA input sequences, type 'all', 'individually', 'n/a', or the file path of the text that specify which sequence is circularDNA, after the '--circularDNA' option.)
--gene_annotation_search_range The gene annotation search range in the GenBank-format file. (default: 100)
--LowQualitySequences In in silico PCR analysis, all sequences containing the regions with a high proportion of 'N' bases will be omitted when you specify 'remove' option. This option helps to reduce a computational effort and calculation time. If you select the omitted sequences individually, specify 'individually'. When the 'ignore' option is selected, regions spanning 'N' bases will not be amplified. To use all input sequences for the in silico PCR template, specify the 'remain' option. (Default: 'remain')
--Annotation If the input sequence file is in GenBank format, the amplicon(s) is annotated automatically.
--warning Shows all warnings when you use this option
Design probes {DP}
shrs DP -i <input file> [options]

Below is the full list of supported options for the DP command line.

Option Description
-i, --input_file Input file path (required, format: FASTA or Genbank)
-e, --exclude_file File path for exclusion sequence(s). Specify the sequence(s) file path of the bacteria if there are some bacteria that you would like to not hybridize. (format: FASTA or Genbank)
-s, --probe_size Probe size (default: 25)
-a, --allowance Mismatch allowance ratio (default: 0.25. The value means that a 7-base [25 * 0.25] mismatch is accepted). Note that setting this parameter too large might causes the increased run time and excessive memory consumption.
-r, --range Search range from probe size (default: 0. If the value is 1, the probe that have 25–26 base length is explored)
-d, --distance The minimum distance between annealing sites that are hybridized with a probe (default: 100)
-o, --output Output directory path (Make a new directory if the directory does not exist)
-P, --process The number of processes (sometimes the number of CPU core) used for analysis
--Exclude_mode Choose the method for excluding the sequence(s) from 'fast' or 'standard' when you specify some sequence(s) file path of the bacteria that you would like to exclude using '-e' option (default: fast.)
--Result_output The upper limit of result output (default: 10,000)
--Match_rate The ratio of trials for which the probe candidate fulfilled all criteria when the allowance value is decreased. The higher the number is, the more specific the probe obtained is (default: 0.8)
--Chunks The chunk size for calculation. If the memory usage for calculation using the chunk size exceeds the GPU memory size, the processing unit for calculation will be switched to the CPU automatically (default: Auto)
--Maximum_annealing_site_number The maximum acceptable value of the number of annealing site of the candidate of the probe in the input sequence (default: 5)
--Search_mode There are four options: exhaustive/moderate/sparse/manual. If you choose the 'manual' option, type the ratio to probe length after 'manual'. (e.g. --Search_mode manual 0.2.) (default: moderate when the average input sequence length is >5000, and exhaustive when the average input sequence length is ≤5000)
--withinMemory All analyses are performed within memory (default: False)
--Without_allowance_adjustment Use this option if you do not want to modify the allowance value for every homology calculation (default: False)

Functions contained in library of shrs

  • shrslib.basicfunc
    • class nucleotide_sequence
    • complementary_sequence
    • calculate_Tm_value
    • read_sequence_file
  • shrslib.explore
    • search_position
    • PCR_amplicon
  • shrslib.scores
    • calculate_flexibility
    • calculate_score
    • calculate_diff_length_score
    • fragment_size_distance
    • array_diff
    • sequence_duplicated

See documentation for more detailed information.


Citing SHRS

Please cite the following article:

Takahashi, M., Morikawa, K., Akao, T., 2022. Short-length Homologous Region exhaustive Search algorithm (SHRS): A primer design algorithm for differentiating bacteria at the species, subspecies, or strain level based on a whole genome sequence. J. Microbiol. Methods 203, 106605. DOI: 10.1016/j.mimet.2022.106605


Update information

Version 0.10.0: The argument '--Group_id' has been added to DIP mode.

Version 0.11.0: New ability for designing a probe has been added.

Version 0.12.0: The argument '-f, --primerset_filepath' has been added to iPCR mode.

Version 0.13.0: The '--Reference_tree' and '--Only_sequence_with_feature_key' arguments have been added to the DIP mode. The '--Reference_tree' option sorts the results by Robinson-Foulds distance in comparison to the provided reference tree. The '--Only_sequence_with_feature_key' option generates primer sets based on regions annotated with a feature key (For more information, see Tips section above). The default value of the number of processors used in in silico PCR analysis has been changed to '1'.


Reference

The sequences used in this tutorial are below. All sequences have been downloaded from INSDC member database and RefSeq database.

Strain name Accession Format
Mycoplasma genitalium G37 L43967 GenBank
Mycoplasma genitalium G37 AAGX01000001 - AAGX01000025 Fasta
Mycoplasma genitalium M2288 CP003773 GenBank
Mycoplasma genitalium M2321 CP003770 GenBank & Fasta
Mycoplasma genitalium M6282 NC_018496 GenBank & Fasta
Mycoplasma genitalium M6320 CP003772 GenBank & Fasta
Mycoplasma pneumoniae NCTC10119 NZ_LR214945 Fasta

License

This software is released under the MIT License. Copyright (c) 2022 Masayuki TAKAHASHI

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shrs-0.13.2.tar.gz (112.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shrs-0.13.2-py3-none-any.whl (109.4 kB view details)

Uploaded Python 3

File details

Details for the file shrs-0.13.2.tar.gz.

File metadata

  • Download URL: shrs-0.13.2.tar.gz
  • Upload date:
  • Size: 112.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for shrs-0.13.2.tar.gz
Algorithm Hash digest
SHA256 9bde663c76bb986365a75533920e8268dea7f8e80578fa2d8bab878c2646c567
MD5 bf28dbec529ed1f53106dbf2d77c60ca
BLAKE2b-256 ea062f5f51eb0c20a806c3bb615cfd943396a008f15b187e5dee0ecf4478af27

See more details on using hashes here.

File details

Details for the file shrs-0.13.2-py3-none-any.whl.

File metadata

  • Download URL: shrs-0.13.2-py3-none-any.whl
  • Upload date:
  • Size: 109.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for shrs-0.13.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b8afbc3d31389d3a32d015a0c5d28018dcfa9a41595042b537f229d4b7ca3a83
MD5 84b6c87914b3858a42635bfa764d60fc
BLAKE2b-256 86c529e1c41cbac9218db09d6665503ecc123cc9c983f2d77d4adad27e643baf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page