A tool to get a protein sequences from Uniprot using a CLI or in Python.
Project description
getSequence
A tool to get a protein sequences from Uniprot using a CLI or in Python.
V2 was a major update and was released on February 16, 2024. This massively improved the functionality of getSequence from Python and the CLI as well as improving the in code and README.MD documentation.
What is getSequence?
getSequence is a Python package and command-line interface for getting protein sequences from Uniprot. If you have a list of protein names or Uniprot IDs, you can now fetch them from Uniprot either directly from the command-line or in Python.
How does it work?
getSequence will take in your protein name including organism name if you'd like or a Uniprot ID and query Uniprot. It then takes the top hit from Uniprot and gets the sequence information. You can specify multiple things from the command-line or from Python, exactly how you would if you were to use the search box on the Uniprot website. It's pretty nifty.
Installation
getSequence in availbale through PyPi - to install just run...
pip install getSequence
Alternatively, you can get getSequence directly from GitHub.
To clone the GitHub repository and gain the ability to modify a local copy of the code, run
git clone https://github.com/ryanemenecker/getSequence.git
cd getSequence
pip install .
This will install getSequence locally.
Usage:
There are two ways you can use getSequence:
- Directly from the command-line
- From within Python
Using getSequence from the command-line:
To use getSequence from the command-line, simply use the getseq
command followed by the name of your protein. The name of your protein can be just the protein name, the name and organism, or the Uniprot ID. If you use the Uniprot ID, it is recommended that you use the -u
flag to let getseq know that you're inputting a Uniprot ID.
Getting single sequences
By default, getseq will just print your sequence to the terminal. However you can also save the output (more on that later).
Example
getseq p53 mouse
>sp|p02340|p53 mouse cellular tumor antigen p53 os=mus musculus ox=10090 gn=tp53 pe=1 sv=4
MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD
Getting multiple sequences
You can also get multiple sequences using a single command by separating your queries by commas. Multiple entries must be comma separated.
Example
getseq mouse p53, human p53, dog p53
>sp|p02340|p53 mouse cellular tumor antigen p53 os=mus musculus ox=10090 gn=tp53 pe=1 sv=4
MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD
>sp|p04637|p53 human cellular tumor antigen p53 os=homo sapiens ox=9606 gn=tp53 pe=1 sv=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
>sp|q29537|p53 canlf cellular tumor antigen p53 os=canis lupus familiaris ox=9615 gn=tp53 pe=2 sv=2
MEESQSELNIDPPLSQETFSELWNLLPENNVLSSELCPAVDELLLPESVVNWLDEDSDDAPRMPATSAPTAPGPAPSWPLSSSVPSPKTYPGTYGFRLGFLHSGTAKSVTWTYSPLLNKLFCQLAKTCPVQLWVSSPPPPNTCVRAMAIYKKSEFVTEVVRRCPHHERCSDSSDGLAPPQHLIRVEGNLRAKYLDDRNTFRHSVVVPYEPPEVGSDYTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNVLGRNSFEVRVCACPGRDRRTEEENFHKKGEPCPEPPPGSTKRALPPSTSSSPPQKKKPLDGEYFTLQIRGRERYEMFRNLNEALELKDAQSGKEPGGSRAHSSHLKAKKGQSTSRHKKLMFKREGLDSD
Disabling the separation of multiple entries by commas
You can disable the separation of multiple entries by commas using the -i
or -ignore-commas
flag. This will let you input queries of protein names that contain commmas without having the query separated into multiple queries!
Example
getseq human Beta-1,4 N-acetylgalactosaminyltransferase -i
>sp|q6l9w6|b4gn3 human beta-1,4-n-acetylgalactosaminyltransferase 3 os=homo sapiens ox=9606 gn=b4galnt3 pe=1 sv=2
MGSPRAARPPLLLRPVKLLRRRFRLLLALAVVSVGLWTLYLELVASAQVGGNPLNRRYGSWRELAKALASRNIPAVDPHLQFYHPQRLSLEDHDIDQGVSSNSSYLKWNKPVPWLSEFRGRANLHVFEDWCGSSIQQLRRNLHFPLYPHIRTTLRKLAVSPKWTNYGLRIFGYLHPFTDGKIQFAIAADDNAEFWLSLDDQVSGLQLLASVGKTGKEWTAPGEFGKFRSQISKPVSLSASHRYYFEVLHKQNEEGTDHVEVAWRRNDPGAKFTIIDSLSLSLFTNETFLQMDEVGHIPQTAASHVDSSNALPRDEQPPADMLRPDPRDTLYRVPLIPKSHLRHVLPDCPYKPSYLVDGLPLQRYQGLRFVHLSFVYPNDYTRLSHMETHNKCFYQENAYYQDRFSFQEYIKIDQPEKQGLEQPGFEENLLEESQYGEVAEETPASNNQNARMLEGRQTPASTLEQDATDYRLRSLRKLLAQPREGLLAPFSKRNSTASFPGRTSHIPVQQPEKRKQKPSPEPSQDSPHSDKWPPGHPVKNLPQMRGPRPRPAGDSPRKTQWLNQVESYIAEQRRGDRMRPQAPGRGWHGEEEVVAAAGQEGQVEGEEEGEEEEEEEDMSEVFEYVPVFDPVVNWDQTFSARNLDFQALRTDWIDLSCNTSGNLLLPEQEALEVTRVFLKKLNQRSRGRYQLQRIVNVEKRQDQLRGGRYLLELELLEQGQRVVRLSEYVSARGWQGIDPAGGEEVEARNLQGLVWDPHNRRRQVLNTRAQEPKLCWPQGFSWSHRAVVHFVVPVKNQARWVQQFIKDMENLFQVTGDPHFNIVITDYSSEDMDVEMALKRSKLRSYQYVKLSGNFERSAGLQAGIDLVKDPHSIIFLCDLHIHFPAGVIDAIRKHCVEGKMAFAPMVMRLHCGATPQWPEGYWEVNGFGLLGIYKSDLDRIGGMNTKEFRDRWGGEDWELLDRILQAGLDVERLSLRNFFHHFHSKRGMWSRRQMKTL
Additional usage from the command-line
Saving the output of a query
If you want to save the output, you can use the -o
or --output
flag. This will save the output to a file to your specified destination.
Example - multiple sequence
getseq mouse p53, human p53, dog p53 -o /Users/thisCoolUser/Desktop/my_sequences/p53_seqs.fasta
Sequences saved to /Users/thisCoolUser/Desktop/my_sequences/p53_seqs.fasta
NOTE - by default, if you have any queries that fail and you're saving the output, getseq will save a second file called 'failed_queries.txt' that contains all queries that failed. getseq will let you know if this happens!
Example -
getseq mouse p53, human p53, doggggg p53 -o /Users/thisCoolUser/Desktop/my_sequences/p53_seqs.fasta
Sequences saved to /Users/thisCoolUser/Desktop/my_sequences/p53_seqs.fasta
Failed queries detected!
Failed queries saved to /Users/thisCoolUser/Desktop/my_sequences/failed_queries.txt
If you want to disable the save file output, you can use the -p
or -print_failed_queries
flag. This will just print the failed queries to the terminal and not save it to a file.
Example -
getseq mouse p53, human p53, doggggg p53 -o /Users/thisCoolUser/Desktop/my_sequences/p53_seqs.fasta
Sequences saved to /Users/thisCoolUser/Desktop/my_sequences/p53_seqs.fasta
Failed queries detected!
Failed queries:
Unable to retrieve sequence for the query 'doggggg p53'
Query by Uniprot ID
If you want to search by Uniprot ID, use the -u
or --uniprot
flag. The getseq command doesn't require you to use the -u
flag when using a Uniprot ID as the query, but it does increase the accuracy of your query if you do. If you use the -u
flag, you can also search for specific isoforms. If you do not use the -u flag, you will not be able to retrieve specific isoforms.
Example isoform 2
getseq P49021-2, P49021-3 -u
>>sp|P49021-2|TIM_DROME Isoform R of Protein timeless OS=Drosophila melanogaster OX=7227 GN=tim
MDWLLATPQLYSAFSSLGCLEGDTYVVNPNALAILEEINYKLTYEDQTLRTFRRAIGFGQNVRSDLIPLLENAKDDAVLESVIRILVNLTVPVECLFSVDVMYRTDVGRHTIFELNKLLYTSKEAFTEARSTKSVVEYMKHILESDPKLSPHKCDQINNCLLLLRNILHIPETHAHCVMPMMQSMPHGISMQNTILWNLFIQSIDKLLLYLMTCPQRAFWGVTMVQLIALIYKDQHGSGDSSPMLTSDPTSDSSDNGSNGRGMGGGMREGTAATLQEVSRKGQEYQNAMARVPADKPDGSEEASDMTGNDSEQPGSPEQSQPAGESMDDGDYEDQRHRQLNEHGEEDEDEDEVEEEEYLQLGPASEPLNLTQQPADKVNNTTNPTSSAPQGCLGNEPFKPPPPLPVRASTSAHAQMQKFNESSYASHVSAVKLGQKSPHAGQLQLTKGKCCPQKRECPSSQSELSDCGYGTQVENQESISTSSNDDDGPQGKPQHQKPPCNTKPRNKPRTIMSPMDKKELRRKKLVKRSKSSLINMKGLVQHTPTDDDISNLLKEFTVDFLLKGYSYLVEELHMQLLSNAKVPIDTSHFFWLVTYFLKFAAQLELDMEHIDTILTYDVLSYLTYEGVSLCEQLELNARQEGSDLKPYLRRMHLVVTAIREFLQAIDTYNKVTHLNEDDKAHLRQLQLQISEMSDLRCLFVLLLRRFNPSIHSKQYLQDLVVTNHILLLILDSSAKLGGCQTIRLSEHITQFATLEVMHYYGILLEDFNNNGEFVNDCIFTMMHHIGGDLGQIGVLFQPIILKTYSRIWEADYELCDDWSDLIEYVIHKFMNTPPKSPLTIPTTSLTEMTKEHNQEHTVCSWSQEEMDTLYWYYVQSKKNNDIVGKIVKLFSNNGNKLKTRISIIQQLLQQDIITLLEYDDLMKFEDAEYQRTLLTTPTSATTESGIEIKECAYGKPSDDVQILLDLIIKENKAQHLLWLQRILIECCFVKLTLRSGLKVPEGDHIMEPVAYHCICKQKSIPVVQWNNEQSTTMLYQPFVLLLHKLGIQLPADAGSIFARIPDYWTPETMYGLAKKLGPLDKLNLKFDASELEDATASSPSRYHHTGPRNSLSSVSSLDVDLGDTEELALIPEVDAAVEKAHAMASTPSPSEIFAVPKTKHCNSIIRYTPDPTPPVPNWLQLVMRSKCNHRTGPSGDPSDCIGSSSTTVDDEGFGKSISAATSQAASTSMSTVNPTTTLSLNMLNTFMGSHNENSSSSGCGGTVSSLSMVALMSTGAAGGGGNTSGLEMDVDASMKSSFERLEVNGSHFSRANNLDQEYSAMVASVYEKEKELNSDNVSLASDLTRMYVSDEDDRLERTEIRVPHYH
>>sp|P49021-3|TIM_DROME Isoform O of Protein timeless OS=Drosophila melanogaster OX=7227 GN=tim
MDWLLATPQLYSAFSSLGCLEGDTYVVNPNALAILEEINYKLTYEDQTLRTFRRAIGFGQNVRSDLIPLLENAKDDAVLESVIRILVNLTVPVECLFSVDVMYRTDVGRHTIFELNKLLYTSKEAFTEARSTKSVVEYMKHILESDPKLSPHKCDQINNCLLLLRNILHIPETHAHCVMPMMQSMPHGISMQNTILWNLFIQSIDKLLLYLMTCPQRAFWGVTMVQLIALIYKDQHVSTLQKLLSLWFEASLSESSEDNESNTSPPKQGSGDSSPMLTSDPTSDSSDNGSNGRGMGGGMREGTAATLQEVSRKGQEYQNAMARVPADKPDGSEEASDMTGNDSEQPGSPEQSQPAGESMDDGDYEDQRHRQLNEHGEEDEDEDEVEEEEYLQLGPASEPLNLTQQPADKVNNTTNPTSSAPQGCLGNEPFKPPPPLPVRASTSAHAQMQKFNESSYASHVSAVKLGQKSPHAGQLQLTKGKCCPQKRECPSSQSELSDCGYGTQVENQESISTSSNDDDGPQGKPQHQKPPCNTKPRNKPRTIMSPMDKKELRRKKLVKRSKSSLINMKGLVQHTPTDDDISNLLKEFTVDFLLKGYSYLVEELHMQLLSNAKVPIDTSHFFWLVTYFLKFAAQLELDMEHIDTILTYDVLSYLTYEGVSLCEQLELNARQEGSDLKPYLRRMHLVVTAIREFLQAIDTYNKVTHLNEDDKAHLRQLQLQISEMSDLRCLFVLLLRRFNPSIHSKQYLQDLVVTNHILLLILDSSAKLGGCQTIRLSEHITQFATLEVMHYYGILLEDFNNNGEFVNDCIFTMMHHIGGDLGQIGVLFQPIILKTYSRIWEADYELCDDWSDLIEYVIHKFMNTPPKSPLTIPTTSLTEMTKEHNQEHTVW
Just print sequence
If you just want to print the sequence, use the -j
or --just_sequence
flag. WARNING: If you do this, you will not know if you got the exact sequence that you want! The reason the Uniprot ID, organism, and protein name are printed back to you is to help you check that you got the protien that you want!
Example
getseq p53 mouse -j
MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD
Using getSequence from the Python:
getSequence has one function in Python called getseq
. To use it, Firt you need to import getseq.
from getSequence import getseq
Now you can use getseq.
The getseq
function can take either a string that is a single query or a list of strings for multiple queries. If it is a single query, by default it returns a list with the first entry in the list being the Uniprot header and the second one being the sequence. For multiple queries, a list of lists is returned.
Example - single query
getseq('p53')
['sp|p04637|p53 human cellular tumor antigen p53 os=homo sapiens ox=9606 gn=tp53 pe=1 sv=4', 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD']
Example - multiple queries
getseq(['human p53', ' dog p53', ' mouse p53'])
[['sp|p04637|p53 human cellular tumor antigen p53 os=homo sapiens ox=9606 gn=tp53 pe=1 sv=4', 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD'], ['sp|q29537|p53 canlf cellular tumor antigen p53 os=canis lupus familiaris ox=9615 gn=tp53 pe=2 sv=2', 'MEESQSELNIDPPLSQETFSELWNLLPENNVLSSELCPAVDELLLPESVVNWLDEDSDDAPRMPATSAPTAPGPAPSWPLSSSVPSPKTYPGTYGFRLGFLHSGTAKSVTWTYSPLLNKLFCQLAKTCPVQLWVSSPPPPNTCVRAMAIYKKSEFVTEVVRRCPHHERCSDSSDGLAPPQHLIRVEGNLRAKYLDDRNTFRHSVVVPYEPPEVGSDYTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNVLGRNSFEVRVCACPGRDRRTEEENFHKKGEPCPEPPPGSTKRALPPSTSSSPPQKKKPLDGEYFTLQIRGRERYEMFRNLNEALELKDAQSGKEPGGSRAHSSHLKAKKGQSTSRHKKLMFKREGLDSD'], ['sp|p02340|p53 mouse cellular tumor antigen p53 os=mus musculus ox=10090 gn=tp53 pe=1 sv=4', 'MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD']]
Additional usage in Python
Searching by Uniprot ID
When searching by Uniport ID, it is recommended that you explicitly set uniprot_id=True
when using the getseq
function. This will make the query explicitly call that specific Uniprot ID instead of searching Uniprot for the ID. This is the only way that you can get specific splicing isoforms when using getseq because searching specific isoforms using the Uniprot search functionality doesn't gaurantee the correct isoform.
Example
getseq('P49021-3', uniprot_id=True)
['>sp|P49021-3|TIM_DROME Isoform O of Protein timeless OS=Drosophila melanogaster OX=7227 GN=tim', 'MDWLLATPQLYSAFSSLGCLEGDTYVVNPNALAILEEINYKLTYEDQTLRTFRRAIGFGQNVRSDLIPLLENAKDDAVLESVIRILVNLTVPVECLFSVDVMYRTDVGRHTIFELNKLLYTSKEAFTEARSTKSVVEYMKHILESDPKLSPHKCDQINNCLLLLRNILHIPETHAHCVMPMMQSMPHGISMQNTILWNLFIQSIDKLLLYLMTCPQRAFWGVTMVQLIALIYKDQHVSTLQKLLSLWFEASLSESSEDNESNTSPPKQGSGDSSPMLTSDPTSDSSDNGSNGRGMGGGMREGTAATLQEVSRKGQEYQNAMARVPADKPDGSEEASDMTGNDSEQPGSPEQSQPAGESMDDGDYEDQRHRQLNEHGEEDEDEDEVEEEEYLQLGPASEPLNLTQQPADKVNNTTNPTSSAPQGCLGNEPFKPPPPLPVRASTSAHAQMQKFNESSYASHVSAVKLGQKSPHAGQLQLTKGKCCPQKRECPSSQSELSDCGYGTQVENQESISTSSNDDDGPQGKPQHQKPPCNTKPRNKPRTIMSPMDKKELRRKKLVKRSKSSLINMKGLVQHTPTDDDISNLLKEFTVDFLLKGYSYLVEELHMQLLSNAKVPIDTSHFFWLVTYFLKFAAQLELDMEHIDTILTYDVLSYLTYEGVSLCEQLELNARQEGSDLKPYLRRMHLVVTAIREFLQAIDTYNKVTHLNEDDKAHLRQLQLQISEMSDLRCLFVLLLRRFNPSIHSKQYLQDLVVTNHILLLILDSSAKLGGCQTIRLSEHITQFATLEVMHYYGILLEDFNNNGEFVNDCIFTMMHHIGGDLGQIGVLFQPIILKTYSRIWEADYELCDDWSDLIEYVIHKFMNTPPKSPLTIPTTSLTEMTKEHNQEHTVW']
Just return the sequence
To just return the protein sequence as a string, set just_sequence=True
Example
getseq('p53', just_sequence=True)
'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD'
Return a dictionary instead of a list
If you want to return a dictionary instead of a list, set return_dict=True
. The dictionary will have the Uniprot ID as the key and the sequence as the value.
Example
getseq(['human p53', ' dog p53', ' mouse p53'], return_dict=True)
{'sp|p04637|p53 human cellular tumor antigen p53 os=homo sapiens ox=9606 gn=tp53 pe=1 sv=4': 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD', 'sp|q29537|p53 canlf cellular tumor antigen p53 os=canis lupus familiaris ox=9615 gn=tp53 pe=2 sv=2': 'MEESQSELNIDPPLSQETFSELWNLLPENNVLSSELCPAVDELLLPESVVNWLDEDSDDAPRMPATSAPTAPGPAPSWPLSSSVPSPKTYPGTYGFRLGFLHSGTAKSVTWTYSPLLNKLFCQLAKTCPVQLWVSSPPPPNTCVRAMAIYKKSEFVTEVVRRCPHHERCSDSSDGLAPPQHLIRVEGNLRAKYLDDRNTFRHSVVVPYEPPEVGSDYTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNVLGRNSFEVRVCACPGRDRRTEEENFHKKGEPCPEPPPGSTKRALPPSTSSSPPQKKKPLDGEYFTLQIRGRERYEMFRNLNEALELKDAQSGKEPGGSRAHSSHLKAKKGQSTSRHKKLMFKREGLDSD', 'sp|p02340|p53 mouse cellular tumor antigen p53 os=mus musculus ox=10090 gn=tp53 pe=1 sv=4': 'MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD'}
Ignore individual queries when doing multiple queries
By default, getseq will raise an Exception if it is unable to return a sequence using your query. You can disable this by setting ignore_failures
to True. If you do this, getseq will print a out your failed queries instead of raising an Exception.
Example
getseq(['human p53', ' doggggggg p53', ' mouse p53'], ignore_failures=True)
Failed queries:
Unable to retrieve sequence for the query ' doggggggg p53'
[['sp|p04637|p53 human cellular tumor antigen p53 os=homo sapiens ox=9606 gn=tp53 pe=1 sv=4', 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD'], ['sp|p02340|p53 mouse cellular tumor antigen p53 os=mus musculus ox=10090 gn=tp53 pe=1 sv=4', 'MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPSPHCMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD']]
Changes
V2.2 (October 2024)
- Made pep517 compliant, now using tomls. h/t alex
V2.1 (March 2024)
- Added parallelization for queries h/t jeff
V2.0 (February 2024)
- Major update to command-line and Python functionality
- Added ability to do multiple queries from a single command from CLI
- Added ability to do multiple queries by inputting a list in Python
- Added ability to save output to a file from CLI
- Added ability to save output to a file from Python
- Added ability to get different isoforms of sequences from Python and CLI
Copyright
Copyright (c) 2022, Ryan Emenecker
Acknowledgements
•Project based on the Computational Molecular Science Python Cookiecutter version 1.6. •Project motivated by my never ending need to make doing science more enjoyable.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file getsequence-2.2.1.tar.gz
.
File metadata
- Download URL: getsequence-2.2.1.tar.gz
- Upload date:
- Size: 157.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71baa37c89a8dd1fa90c0022caef4b0212f99efddbd4a4eb4a9a84b5a6903123 |
|
MD5 | 61dc2ef7aa212a1980bcb519cc9f35b9 |
|
BLAKE2b-256 | 38058a7fbe8f79b00418136f52323b35174adb113dad336db11572634260188e |
File details
Details for the file getSequence-2.2.1-py3-none-any.whl
.
File metadata
- Download URL: getSequence-2.2.1-py3-none-any.whl
- Upload date:
- Size: 140.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b99e189405ea8e12656a96a838af07640fa7958d2d9721699a8a478e2f87679a |
|
MD5 | 7cdb2e4d5325ea5f0177a1e6115ddb7f |
|
BLAKE2b-256 | 24296540ae742bf4a8b2cab2a47427b0b7ec41dde351108780aa3a476dcf7453 |