A set of scripts to ease the process of manipulating fasta files
Project description
fastaUtils
A set of scripts to ease the process of manipulating fasta files and building Multiple Sequence Alignments
Most functionalities have now been tested
Installation
pip install fastaUtils
Alternatively, clone the repository and simply add the following lines to your ~/.bashrc file
export PATH=$PATH:'<path_to_fastaUtils>/bin'
export PYTHONPATH=$PYTHONPATH:'<path_to_fastaUtils>'
Additional requirements
-
fst-download can be used to download the current version of uniprot (sprot+trembl) and metaclust databases. Even if folder where to save the databases can be passed to the program as an optional field, it is recommended to define the following environment variables:
UNIPROTFOLDER
andMETACLUSTFOLDER
, e.g. by addingexport UNIPROTFOLDER='path fo folder'
andexport METACLUSTFOLDER='path fo folder'
to your /.bashrc file. -
fst-grep requires the regex python package for matches with substitutions (imperfect matches)
Usage
Most commands try to copy the functionality of their equivalent unix function. In all cases run the command with -h to list a detailed help page
For all commands that requires an input fasta file, please note that is this input is not provided the program will try to read from stdin. In this way multiple commands can be concatenated to perform complex operations on fasta files.
fst-awk : run a command on every sequence of the input fasta file.
The language in which the command must be provided is different from the one employed by awk. Here are the main features:
-
BEGIN and END blocks become \BEGIN and \END in fst-awk
-
curly backets {...} always indentify a code block. Code blocks can be nested together
-
\BEGIN{statement1} if statement2 {do something} \END{statement3}
therefore is composed of a BEGIN code block, a main block that is executed only if statement2 is True and an END block. All blocks are optionals: by default the BEGIN and END blocks are empty, statement2 is True and the main block simply print the sequence in fasta format -
all statements should be valid python code, with few exceptions: dictionaries cannot be used (as they would be interpreted as code blocks), curly brackets cannot appear in strings (as they would be interpreted as code blocks), instead of indenting code blocks curly brackets must be used, do not use colons before a code block, use semicolons to separate lines.
-
new variables are used: db,uid,name,beg,end,descr,pe,sv,gn,ox (for each field in the fasta header), NR (current sequence index), seq (header and sequence), seq.seq (sequence) and seq.header (fasta header). If information is not provided in the fasta header, pe,sv and ox will default to -1, and all other variables to -
Examples:
fst-awk '{print(len(seq.seq))}' fastafile
will print the length of each sequence
fst-awk 'len(seq.seq)>100 {print(seq)}' fastafile
will print a fasta file containing all sequences longer than 100 amino acids
fst-awk 'seq.seq.count("-")<10 {print(seq)}' fastafile
will print a fasta file containing all sequences with no more than 10 gaps
fst-grep: search for substrings in a fasta file and return matching sequences.
Similarly to grep it accept the following options:
-
-c : return the number of matching sequences
-
-o : return a fasta file with all the matching parts of all sequences. Multiple matches for protein are possible. The fasta header is modified in order to add the information about the position of the match in the original sequence
-
-v : invert the match, aka return non matching sequences only. Non compatible with -o
fst-grep also accept -s option, specifying how many mismatching characters are acceptable. By default this number is zero.
Some rules/shortcut are available.
-
'_' matches all aminoacids, but not gaps
-
'.' matches gaps only
-
'+' and '-' match positive and negative amino acids
-
'@' matches hydrophobic amino acids
-
'#' matches polar amino acids
Example:
fst-grep 'HPD' fastafile
return all sequences from fastafile with an HPD motif
fst-grep '-@@@@-' -c
count the number of sequences with at least one negative-hydrofobic*4-negative motif
fst-grep 'HPD' -o fastafile -s 1
return all HPD motifs in fastafile, but also motifs in which at most one substitution is present (e.g. NPD)
fst-grep 'HPD' -v fastafile
return all sequences from fastafile without an HPD motif
fst-grep 'HPD|NPD' fastafile
return all sequences from fastafile with an HPD or NPD motif
fst-paste: append multiple sequences from different fasta files
The number of sequeces in the final fasta file is equal to the minimum number of sequences in the input files
Example:
fst-paste fasta1 fasta2 ...
fst-cut: remove columns from a fasta file
Similarly to cut, you can remove selected columns by passing -c beg-end col1 .. colN. Also accept -c beg- : in this case all columns from beg to the very last
in the sequence are removed. The counting starts at 1, so fst-cut -c 1 fastafile
will remove the first residue. If a range of column is used, the extrema are both removed.
To remove all columns except for selected ones, use the -r flag.
fst-cut accept a rule to dynamically decide which column to remove. The rule is written in python and passed with -R flag. A profile/logo is required in order to have access to aminoacid frequencies.
Example:
fst-cut -c 1-3 5 fastafile
print a fasta file in which columns 1,2,3 and 5 have been omitted
fst-cut -c 1-3 5 -r fastafile
print a fasta file with only columns 1,2,3 and 5
fst-cut -R 'freq["-"]>0.5' -p profile fastafile
print a fasta file where all columns with a frequency of gaps greater than 0.5 have been removed
fst-sort: reorder sequences
Reorder sequences (possibly in reverse order) according to a field in fasta header. Optionally the sequences can be shuffled (random order) or be ordered according to a predefined order (provided through an input file)
Example:
fst-sort -k uid fastafile
reorder sequences in alphabetical order of uid
fst-sort -k uid -F orderfile fastafile
reorder sequences to match orderfile. Orderfile must contain a list of uids (one per line). The output fasta will not contain sequences with an uid not in orderfile
fst-sort -R fastafile
shuffle input fasta file
fst-sort -k uid -r fastafile
reorder sequences in reverse alphabetical order of uid
fst-shuf: shuffle columns independently
Example:
fst-shuf -c 1 3 5 fastafile
will return a fasta file in which columns 1,3 and 5 have been shuffled independently from the others
fst-shuf -r 1:3 fastafile
will return a fasta file in which column 1,2 and 3 have been shuffled coherently with each others but independently from the rest of the sequence
fst-random: generate random sequences that match a given profile
Generated sequences can be completely random or build by modifying real ones.
Example:
fst-random profile 100
generate 100 random sequences with the single site frequencies defined in profile
fst-random profile 100 -t fastafile -d 0.9
generate 100 sequences by randomly mutate a fraction equal to 0.1 (1.-0.9) of amino acids. Mutations are such that
the single site frequencies are the same as in profile. The mutated residues are randomly chosen for each sequence. The real sequences are chosen randomly among those provided in fastafile,
possibly more than once
fst-profile: generate the profile/logo of a multiple sequence alignment
A profile is simply a list of all possible symbols that appear for each column together with its frequency
Example:
fst-profile fastafile
will print the profile to screen
fst-encode:
Encode a Multiple Sequence Alignment in binary vectors by concatenating the one-hot encoding of each amino acid.
Example:
fst-encode fastafile profile encoding.npy
fst-fromstockholm: convert from stockholm format to fasta format
Example:
fst-fromstockholm stockholmfile
will print the corresponding fasta file to screen
fst-download: download sequences and metadata from uniprot and/or metaclust
This program defines three commands:
-
fst-download uniprot
will download and extract uniprot_trembl and uniprot_sprot to $UNIPROTFOLDER/date/ and create a symlink to it in $UNIPROTFOLDER/current_release. Currently the download is from https://ftp.expasy.org -
fst-download metaclust
will download and extract the latest metaclust release in $METACLUSTFOLDER/date/ and create a symlink to it in $METACLUSTFOLDER/current_release. -
fst-download uniprot-metadata
will take as input a file with a list of uids, and will download all metadata for the corresponding sequences from the current uniprot release.
fst-split:
Split an input fasta file in multiple fasta files containing no more than a specified number of entries.
Example:
fst-split fastafile records -l 100
will create new fasta files named records_*.fasta
each containing no more than 100 entries from fastafile
.
fst-distance:
Return the pairwise hamming distance (or sequence identity) between sequences in a MSA or between sequences in two MSAs. Optionally the distances/identities can be averaged.
Example:
fst-distance fastafile1 --aggregate -to fastafile2 --seqid
fst-logo:
Generates a logo from a profile thanks to logomaker, a package for generating publication quality sequence logos. There are options to control aesthetics.
Example:
fst-logo profile output.png
generates something like this:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fastaUtils-1.0.0.tar.gz
.
File metadata
- Download URL: fastaUtils-1.0.0.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | adf44e4d8bdbeff9c3c0b452446316c0519695b0fc63a3ae8fd2cf92d53b5241 |
|
MD5 | 0c40bd6bb82b22178391e59523dade80 |
|
BLAKE2b-256 | 6572ab1b8fbf46e142aea3f01b5b42be38a6c0aeb174b7ce390182ad56a2dfa2 |
File details
Details for the file fastaUtils-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: fastaUtils-1.0.0-py3-none-any.whl
- Upload date:
- Size: 25.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2fa8d6536fc1bb7e3db86cd60ee30debe77ff9618c2f3cb7bcce54bd0d2e6bf |
|
MD5 | 8a87d5477c54b5f105b3d179f141837a |
|
BLAKE2b-256 | 424e03900a96b8496a79150f1488153b384fc4352d672cdb04502c44307b45c6 |