Single Amino Acids Polymorphism Statistics
Project description
SAAPS
Single Amino Acids Polylimorphism Statistics (SAAPS) is a Python3-based character interface program that can be used and deployed on multiple systems, including Windows, Linux, and MacOS. The software is mainly used to quickly calculate single amino acid polymorphism (SAP), Shannon information entropy at each site of the sequence, and to analyze sequence features and cluster analysis by using Onehot encoding and dimensionality reduction algorithms.
The raw input file of the SAAPS software is very simple, as long as the sequence file after alignment (nucleotide or amino acid), it can be used for processing and analysis. The main workflow of SAAPS is as follows: Firstly, by analyzing the characteristics of each sequence, the SAPs and Shannon information entropy of each point are calculated, and the corresponding result figures can be drawn at the same time. Then SAAPS screens the key sites according to the specified information entropy range, and transforms them into quantificable data by using Onehot encoding. Finally, the certain dimensionality reduction algorithm (PCA or t-SNE) is used to calculate the similarity level between sequences and draw the clustering scatter plot.
1. Download and install
SAAPS is developed via Python 3
, and you can get and install in following ways.
1.1 Pip method
SAAPS has been distributed to the standard library pf PYPI, and can be easily installed by the tool pip
.
pip install saaps
saaps -h
1.2 Local installation
In addition to the pip method, you can download and install it using the file setup.py
.
You need to download the repository, and then run:
python setup.py install
saaps -h
2. Getting help
You can use "-h" or "--help" to get the help document. The following is a brief introduction to the each parameter of SAAPS:
Input Options
Parameter | Description |
---|---|
-i, --input | the path of input file |
-sr, --shannon_result | the path of shannon result |
Basic Options
Parameter | Description |
---|---|
-dg, --delete_gap | delete the columns containing gaps in sequences |
-t, --translate | translate nucleotides to amino acids |
-cs, --cut_start | the start fragment/site of target sequence |
-ce, --cut_end | the end fragment/site of target sequence |
-osn, --output_seq_name | output sequence names |
-csn, --change_seq_name | change the sequence names |
Polymorphism Options
Parameter | Description |
---|---|
-cp, --compute_ppm | compute polymorphism |
-pp, --plot_ppm | plot polymorphism figure |
Plotting Options
Parameter | Description |
---|---|
-fw, --figure_width | the width of figure |
-fh, --figure_height | the height of figure |
-fname, --figure_name | the name of figure |
-c, --color | the palette of figure |
-pac, --print_all_colors | print all sipported palettes |
-dpi, --DPI | the dpi of the figure |
-fm, --figure_format | the format of figure |
-tp, --transparent | make the background of the figure transparent |
-sl, --scatter_label | display scatter plot labels |
Shannon Entropy Options
Parameter | Description |
---|---|
-sha, --shannon | compute shannon entropy |
-smax, --shannon_max | the maximum of the shannon value (IC) |
-smin, --shannon_min | the minimum of the shannon value (IC) |
-pic, --plot_ic | plot IC scatter figure |
Dimension Reduction Options
Parameter | Description |
---|---|
-oh, --onehot | OneHot Encoding |
-pca, --PCA | dimensionality reduction by PCA |
-tsne, --TSNE | dimensionality reduction by t-SNE |
-tpp, --tSNE_perplexity | the perplexity in t_SNE, default is 3 |
-tlr, --tSNE_learn_rate | the learning rate in t_SNE, default is 200 |
-tni, --tSNE_n_iter | the n_iter in t_SNE, default is 5000 |
-trs, --tSNE_random_state | the random state in t_SNE, default is 1 |
-pc, --plot_cluster | plot clustering scatter plot |
Output Options
Parameter | Description |
---|---|
-o, --output_dir | the directory path of output file, the default is the current user document folder |
-pre, --prefix | the prefix of the output file name |
3. Examples
This chapter uses the test data ('testSeq.txt' in the test folder) provided by the software as an example to describe the main functions of SAAPS. Note that all sample files in this manual are generated by the software using test data (" testSeq.txt "), and users can follow the tutorial to learn it before exploring other features on their own.
3.1 Preparation for raw materials
Users needs to prepare an alignment sequence file (nucleotide or amino acid file), the alignment process can be completed by MAFFT and other software, and the format of the sequence file after alignment should be "sequence name + (line/multiple lines) sequence body" structure. The aligned sequence file should look like this:
>TestSeqNo.1 Sequence1
GPQSGAIYVG……LWLDEEAMEQ
>TestSeqNo.2 Sequence2
GPLGQQSGAV……LWLDDEVMEQ
...
3.2 Sequence preprocessing
Before the subsequent analysis, it is usually necessary to format the sequence, such as whether the nucleotide sequence needs to be translated into amino acids, whether the sequence name needs to be changed, and so on. This allows us to get the best results in the subsequent analysis and drawing. Here we use the example of modifying the sequence name:
3.2.1 Output sequence names
saaps -i your/seq/path -osn -o your/output/path
Code explaination: -i your/seq/path
indicates that the matched sequence file is passed. -osn
indicates the sequence name in the output sequence file. By default, the output is in a .csv
table. -o your/output/path
indicates the output path. If this parameter is not specified, the result will be output in the default path.After the program is executed, a file named SeqName.csv
will be generated, which can be used for users to modify the sequence names individually. It is recommended to save the modified new names in the second column, so that subsequent programs can modify the sequence names in batches. The format of the file is as follows:
Old_Name | |
---|---|
>TestSeqNo.1 Sequence1 | |
>TestSeqNo.1 Sequence2 | |
... |
3.2.2 Change sequence names
saaps -i your/seq/path -csn your/names/file/path -o your/output/path
Code explaination:In this step, -csn your/names/file/path
represents the matching table of the old and new names in the sequence, which is the table output in the previous step. The pairing table format for the sequence names is as follows:
Old_Name | New_Name |
---|---|
>TestSeqNo.1 Sequence1 | >Seq1 |
>TestSeqNo.1 Sequence2 | >Seq2 |
... | ... |
The modified sequence file is named "NewNameSeq.txt" by default. The comparison between the original sequence file and the modified sequence file is as follows:
Old Seq File | New Seq File |
---|---|
>TestSeqNo.1 Sequence1 | >Seq1 |
GPQSGAIYVG……LWLDEEAMEQ | GPQSGAIYVG……LWLDEEAMEQ |
... | ... |
3.3 Polymorphism analysis
Although SAAPS was developed only for the analysis of SAP, the basis of polymorphism analysis is statistical analysis of the symbols and frequencies of each site in the sequence, so SAAPS can also be applied to sequences such as nucleotides.
saaps -i your/seq/path -cp -pp -o your/output/path
Code explaination:-csn your/names/file/path
represents the matching table of the old and new names in the sequence, which is the table output in the previous step. -cp
represents to activate the computational polymorphism function. -pp
represents that the polymorphism result is plotted.
The statistical result of polymorphism analysis will generate three files named Concise_Result.txt
, Detail_Result.txt
, and Seq_Matrix.csv
. If the drawing parameter (-pp
) is set, A resulting graph (default) named SAPs.pdf
is generated accordingly. The main result files are explained as follows:
Concise_Result.txt: a concise result of polymorphism analysis, which mainly records the frequency of amino acids at each site, and is also an input file for the subsequent calculation of information entropy at each point. The format is shown as follows (part) :
Site 1 (G, 15, 100.00%)
Site 2 (K, 7, 46.67%) (R, 1, 6.67%) (C, 1, 6.67%) …
… … … … …
Detail_Result.txt: detailed result of polymorphism analysis, which records the sequence name of each amino acid at each site. Its file content is shown below (part) :
col1 | col2 | col3 | coln |
---|---|---|---|
Site1 | (G: All Sequences) | ||
Site2 | (P: >Seq2,>Seq1) | (K: >Seq10,>Seq15,>Seq7,>Seq13,>Seq9,>Seq12,>Seq14) | ... |
Site3 | (F: Most Sequences) | (L: >Seq2) | ... |
... | ... | ... | ... |
Seq_Matrix.csv: indicates the sequence amino acid matrix. The main contents of the table are as follows (part) :
Seq1 | Seq2 | Seq3 | Seq4 | Seq5 | Seq6 | Seq7 | ... |
---|---|---|---|---|---|---|---|
G | G | G | G | G | G | G | ... |
P | P | A | A | A | A | K | ... |
... | ... | ... | ... | ... | ... | ... | ... |
SAPs.pdf: Polymorphism distribution map. The example figure is shown below, and users can customize the resulting graph according to the drawing parameters provided by the software:
3.4 Shannon entropy calculation
In the SAAPS calculation results, information content (IC value) is used to replace information entropy, and the higher the IC value, the more conservative the amino acid at the site. During SAAPS information entropy analysis, two result files, Shannon_log.txt
and Shannon_IC_Result.csv
, are generated by default.
saaps -i your/Concise/result/path -sha -pic -o your/output/path
Code explaination: When calculating information entropy, we need to pass the Concise_Result.txt
file in the previous step of polymorphism analysis, -sha
means to activate the information entropy calculation function, -pic
means to draw the information entropy result, and a picture file named IC-Plot.pdf
is generated in the output path by default.
Shannon_log.txt: indicates the simple statistical result of IC analysis. The contents of the document are as follows (part) :
Basic Statistic Information of the Shannon Entropy (IC)
IC
Count 150
mean 3.8647165333333335
... ...
Shannon_IC_Result.csv: IC values of all sites in sequence. The table contains all site IC values, which are convenient for the subsequent drawing of IC figure. The contents of the table are as follows (part) :
SITE | IC |
---|---|
Site1 | 4.45943 |
Site2 | 2.52919 |
Site3 | 3.75947 |
Site4 | 3.75947 |
... | ... |
IC-Plot.pdf: Sequence IC scatter figure. The figure is as follows:
3.5 Polymorphism site screening and Onehot encoding
saaps -i your/Seq/Matrix/path -sr your/IC/Result/path -smin ICmin -smax ICmax -o your/output/path
Code explaination:During the Onehot encoding process, the Seq_Matrix.csv
file from the polymorphism analysis in the previous step and the Shannon_IC_Result.csv
file from the information entropy calculation need to be passed in. If only the sites within the threshold range need to be converted, You can use -smin
and -smax
to set the IC threshold range. In the Onehot encoding step, the software automatically generates 4 files: OneHot-Original.csv
, One-Hot-ConciseMatrix.csv
, OneHot-Transform.csv
, and One-Hot-ForIntersect.csv
.
OneHot-Original.csv: a table of polymorphic loci selected according to the IC threshold range and used for onehot encoding. The file content is shown as follows (part) :
3 | 4 | 5 | 6 | 7 | 8 | ... | |
---|---|---|---|---|---|---|---|
>Seq1 | Q | S | G | A | I | Y | ... |
>Seq2 | L | G | Q | Q | S | G | --- |
... | ... | ... | ... | ... | ... | ... | ... |
One-Hot-ConciseMatrix.csv: the process file after the Onehot encoding of OneHot-Original.csv
, where each amino acid is represented by the corresponding binary number. The file content is shown as follows (part) :
3 | ... | |
---|---|---|
Seq1 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] | ... |
Seq2 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] | ... |
... | ... | ... |
OneHot-Transform.csv: the result of upgrading the dimensions according to the One-Hot-ConciseMatrix.csv
file, in which the binary list of each point is expanded into the corresponding number dimensions, and each dimension only contains the corresponding values in the binary list. The main contents of the file are shown as follows (part) :
3_1 | 3_2 | 3_3 | .. | |
---|---|---|---|---|
Seq1 | 0 | 0 | 0 | ... |
Seq2 | 0 | 0 | 0 | ... |
... | ... | ... | ... | ... |
One-Hot-ForIntersect.csv: indicates the record list of sequence polymorphism loci. The table records the amino acids of each sequence at the polymorphic site, which can be used for differential complement analysis of subsequent sets, etc. The contents of the table are as follows (part) :
3 | 4 | 5 | 6 | 7 | ... | |
---|---|---|---|---|---|---|
>Seq1 | 3Q | 4S | 5G | 6A | 7I | ... |
>Seq2 | 3L | 4G | 5Q | 6Q | 7S | ... |
>Seq3 | 3F | 4G | 5Q | 6Q | 7S | ... |
... | ... | ... | ... | ... | ... | ... |
3.6 Dimensionality reduction and clustering
The function of Onehot encoding is to convert classified data into numerical data that can be quantitatively analyzed. After the onehot encoding conversion, the information of each location of each sequence is converted into high-dimensional binary data. Therefore, in order to better analyze the sequence features, we usually adopt a certain dimensionality reduction algorithm to analyze it and observe the relationship between each sequence. Here we take the most commonly used PCA algorithm as an example.
saaps -i your/Transform/result/path -pca -pc -o your/output/path
Code explaination:In the process of dimensionality reduction analysis, the OneHot-Transform.csv
file in the previous step needs to be passed in; -pca
indicates that the PCA dimensionality reduction function is activated. -pc
means to draw a scatter plot based on the dimensionality reduction results. This process produces two result files called PCA.csv
and PCA-Plot.pdf
.
PCA.csv: PCA calculation result of sequence file. The contents of the document are as follows (part) :
0 | 1 | 2 | 3 | ... | |
---|---|---|---|---|---|
Seq1 | 6.254129 | -2.01068 | -0.27758 | -0.33611 | ... |
Seq2 | -0.74108 | -0.47749 | -0.1452 | -1.75072 | ... |
... | ... | ... | ... | ... | ... |
PCA-plot.pdf: Based on the PCA analysis results, the scatter Plot is generated according to the default parameters, and the example figure is as follows:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file saaps-1.4.1.tar.gz
.
File metadata
- Download URL: saaps-1.4.1.tar.gz
- Upload date:
- Size: 892.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e3f480644331a01d5c5a50480077a43e9cf5ca81fc8e571c1112b2bc48f1f67 |
|
MD5 | ffe2e4e20ceac73bce4c202f9071c929 |
|
BLAKE2b-256 | 954fd07a86cf364687ba06263039e734f61478a54ed75bc47f1d151eca5a3c81 |
File details
Details for the file saaps-1.4.1-py3-none-any.whl
.
File metadata
- Download URL: saaps-1.4.1-py3-none-any.whl
- Upload date:
- Size: 888.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18f538600b1454fc957b3df9cbe1d235ccab05fa7185cddbda4f0f52738c2be9 |
|
MD5 | af62646e46eac5afc4eab3e06f8e5028 |
|
BLAKE2b-256 | 0228d2e2e1050a5c3cba6067536d5973d5210ba0596693ad12c7e3e1dfca6d84 |