Comprehensive Python client for the Uniprot REST API
Project description
Unipressed
Please visit the project website for more comprehensive documentation.
Introduction
Unipressed (Uniprot REST) is an API client for the protein database Uniprot. It provides thoroughly typed and documented code to ensure your use of the library is easy, fast, and correct!
Example
Let's say we're interested in very long proteins that are encoded within a chloroplast, in any organism:
import json
from unipressed import UniprotkbSearch
for record in UniprotkbSearch(
query={
"and_": [
{"organelle": "chloroplast"},
{"length": (5000, "*")}
]
},
fields=["length", "gene_names"]
).each_record():
print(json.dumps(record, indent=4))
This will print:
{
"primaryAccession": "A0A088CK67",
"genes": [
{
"geneName": {
"evidences": [
{
"evidenceCode": "ECO:0000313",
"source": "EMBL",
"id": "AID67672.1"
}
],
"value": "ftsH"
}
}
],
"sequence": {
"length": 5242
}
}
Advantages
- Detailed type hints for autocompleting queries as you type
- Autocompletion for return fields
- Documentation for each field
- Automatic results parsing, for
json
,tsv
,list
, andxml
- Built-in pagination, so you don't have to handle any of that yourself!
- Most of the API is automatically generated, ensuring very rapid updates whenever the API changes
- Thoroughly tested, with 41 unit tests and counting!
Usage
Installation
If you're using poetry:
poetry add unipressed
Otherwise:
pip install unipressed
Query Syntax
You can't go wrong by following the type hints.
I strongly recommend using something like pylance
for Visual Studio Code, which will provide automatic completions and warn you when you have used the wrong syntax.
If you already know how to use the Uniprot query language, you can always just input your queries as strings:
UniprotkbSearch(query="(gene:BRCA*) AND (organism_id:10090)")
However, if you want some built-in query validation and code completion using Python's type system, then you can instead use a dictionary. The simplest query is a dictionary with a single key:
UniprotkbSearch(query={ "family": "kinase"})
For brevity, for the rest of this section we will omit everything but the value of the query
argument.
You can compile more complex queries using the and_
, or_
and not_
keys.
These first two operators take a list of query dictionaries:
{
"and_": [
{"family": "kinase"},
{"organism_id": "9606"},
]
}
Most "leaf" nodes of the query tree (ie those that aren't operators like and_
) are strings, integers or floats, which you input as normal Python literals as you can see above.
For string fields, you also have access to wildcards, namely the *
character.
For example, if you want every human protein belonging to a gene whose name starts with PRO
, you could use:
{
"and_": [
{"gene": "PRO*"},
{"organism_id": "9606"},
]
}
A few query fields are ranges, which you input using a tuple with two elements, indicating the start and end of the range.
If you use the literal "*"
then you can leave the range open at one end.
For example, this query returns any protein that is in the range $(5000, \infty)$
{"length": (5000, "*")}
Finally, a few query fields take dates.
These you input as a Python datetime.date
object.
For example, to find proteins added to UniProt since July 2022, we would do:
from datetime import date
UniprotkbSearch(query={"date_created": (date(2022, 7, 1), "*")})
Use with Visual Studio Code
To get VS Code to offer suggestions, press the Trigger Suggest
shortcut which is usually bound to Ctrl + Space
.
In particular, code completion generally won't work until you open a string literal using a quotation mark.
Secondly, to get live access to the documentation, you can either use the Show Hover
shortcut, which is usually bound to Ctrl + K, Ctrl + I
, or you can install the docs-view
extension, which lets you view the docstrings in the sidebar without interfering with your code.
Changelog
0.2.0
Note, if you are using Visual Studio Code, please update Pylance to at least version 2022.8.20.
A bug in earlier versions will give you false errors with this new release of unipressed
.
Added
-
Also allow strings within the query dictionary, so that e.g. this is now allowed:
{ "and_": [ "foo*", "*bar" ] }
This will search for all proteins that have any field that starts with
foo
and any field that ends withbar
. -
Auto generated docstrings for all fields
-
Examples to the documentation of each field
-
Certain missing query fields for the
arba
dataset:cc_scl_term
-
Certain missing query fields for the
proteomes
dataset:organism_id
taxonomy_id
-
Certain missing query fields for the
unirule
dataset:cc_scl_term
-
Certain missing query fields for the
uniparc
dataset:taxonomy_id
-
Certain missing query fields for the
uniprotkb
dataset:organism_id
taxonomy_id
virus_host_id
Removed
- Uniprot seem to have removed certain
uniprokb
query fields, so these are now not part of the accepted query type:ft_metal
ftlen_metal
ft_ca_bind
ftlen_ca_bind
ft_np_bind
ftlen_np_bind
- Likewise, some
uniprotkb
return fields have been removed:ft_ca_bind
ft_metal
ft_np_bind
Internal
- Move from
pyhumps
toinflection
for code generation - Add a test for the date field
- Added tests for all datasets
- Add types for code generation API
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for unipressed-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f6db58762f65e305476412fc5ea77cdae94e844d1c61e7956170956da73610e |
|
MD5 | 081f85844535ada3a5b844b271d2076f |
|
BLAKE2b-256 | 4fd3ad0e986bcc4e93c3867f6ed3d87649f3926c40c7afbed31ab1beb50e9ad2 |