Python package interface for RCSB.org API services
Project description
rcsb-api
Python interface for RCSB PDB API services at RCSB.org.
This package requires Python 3.7 or later.
Installation
Get it from PyPI:
pip install rcsb-api
Or, download from GitHub
To import this package, use:
from rcsbapi.data import Schema, Query
Jupyter Notebooks
A notebook briefly summarizing the README is available in notebooks/quickstart.ipynb, or can be run online using binder:
Another notebook using both Search and Data API packages in a COVID-19 related example is available in notebooks/search_data_workflow.ipynb, or can be run online using binder:
Introduction
The RCSB PDB Data API supports requests using GraphQL, a language for API queries. This package simplifies generating queries in GraphQL syntax.
GraphQL is built on "types" and their associated "fields". All types and their fields are defined in a "schema". An example of a type in our schema is "CoreEntry" and a field under CoreEntry is "exptl" (experimental). Upon initialization, the Data API package fetches the schema from the RCSB PDB website (See Implementation Details for more).
In GraphQL, you must begin your query at specific fields. These are fields like entry, polymer_entity, and polymer_entity_instance (see full list here). Each field can return a scalar (e.g. string, integer) or a type. Every query must ultimately request scalar value(s), which can be seen in the example query below. As shown in the example, only fields are explicitly included in queries while types are implicit. Types are named in CamelCase (CoreEntry) while fields are in snake case (exptl or audit_author).
This is a query in GraphQL syntax requesting the experimental method of a structure with PDB ID 4HHB (Hemoglobin).
{
entry(entry_id: "4HHB") { # returns type "CoreEntry"
exptl { # returns type "Exptl"
method # returns a scalar (string)
}
}
}
Data is returned in JSON format
{
"data": {
"entry": {
"exptl": [
{
"method": "X-RAY DIFFRACTION"
}
]
}
}
}
To generate the same query in this package, you would create a Query object. The Query object must be executed using the .exec()
method, which will return the JSON response as well as store the response as an attribute of the Query object. From the object, you can access the Data API response, get an interactive editor link, or access the arguments used to create the query.
from rcsbapi.data import Query
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["Exptl.method"])
query.exec()
One way this package simplifies making requests is by adding fields that return scalars into the generated query if you request a field that returns a type.
from rcsbapi.data import Query
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["exptl"])
query.exec()
This creates a valid query even though "exptl" doesn't return a scalar. However, the resulting query will be more verbose, requesting all scalar fields under "exptl" (see return_data_list).
Query Objects
Constructing a query object requires three inputs. The JSON response to a query is stored in the response
attribute of a Query object and can be accessed using the get_response()
method.
from rcsbapi.data import Query
# constructing the Query object
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["Exptl.method"])
# executing the query
query.exec()
# accessing the response
print(query.get_response())
input_ids
Specifies which entry, entity, etc you would like to request data for.
This can be a dictionary or a list. Dictionaries must be passed with specific keys corresponding to the input_type. You can find the key names by using the get_input_id_dict(input_type)
method (see Helpful Methods) or by looking in the GraphiQL editor Docs menu. Lists must be passed in PDB identifier format.
Type | PDB ID Format | Example |
---|---|---|
polymer, branched, or non-polymer entities | [entry_id]_[entity_id] | 4HHB_1 |
polymer, branched, or non-polymer entity instances | [entry_id].[asym_id] | 4HHB.A |
biological assemblies | [entry_id]-[assembly_id] | 4HHB-1 |
interface | [entry_id]-[assembly_id].[interface_id] | 4HHB-1.1 |
Dictionaries and Lists will be treated equivalently for the input_ids argument. For example, these input_ids arguments are equivalent.
# input_type is polymer_entity_instance
input_ids=["4HHB.A"]
input_ids={"entry_id":"4HHB", "asym_id":"A"}
# input_type is polymer_entity_instances (plural)
input_ids=["4HHB.A","4HHB.B"]
input_ids={"instance_ids":["4HHB.A","4HHB.B"]}
input_type
Specifies which field you are starting your query from.
input_types, also called "root fields", are designated points where you can begin querying. This includes entry, polymer_entity, polymer_entity_instance, etc. For the full list see below:
Full list of input_types
- entry
- entries
- polymer_entity
- polymer_entities
- branched_entity
- branched_entities
- nonpolymer_entity
- nonpolymer_entities
- polymer_entity_instance
- polymer_entity_instances
- nonpolymer_entity_instance
- nonpolymer_entity_instances
- branched_entity_instance
- branched_entity_instances
- assembly
- assemblies
- interface
- interfaces
- uniprot
- pubmed
- chem_comp
- chem_comps
- entry_group
- entry_groups
- polymer_entity_group
- polymer_entity_groups
- group_provenance
return_data_list
These are the data that you are requesting (or "fields").
In GraphQL syntax, the final requested data must be a "scalar" type (string, integer, boolean). However, if you request non-scalar data, the package will auto-populate the query to include all fields under the specified data until scalars are reached. Once you receive the query response and understand what specific data you would like to request, you can refine your query by requesting more specific fields.
from rcsbapi.data import Query
query = Query(input_ids={"entry_id":"4HHB"}, input_type="entry", return_data_list=["exptl"])
query.exec()
{
"data": {
"entry": {
"exptl": [
{
"details": null,
"crystals_number": null,
"method_details": null,
"method": "X-RAY DIFFRACTION"
}
]
}
}
}
This query can be made more concise by specifying a field, like "method". In this case, the field name "method" is redundant because it appears under other types and must be further specified using dot notation. For more details see ValueError: Not a unique field
from rcsbapi.data import Query
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["Exptl.method"])
query.exec()
{
"data": {
"entry": {
"exptl": [
{
"method": "X-RAY DIFFRACTION"
}
]
}
}
}
Helpful Methods
There are several methods included to make working with query objects easier. These methods can help you refine your queries to request exactly and only what you want and further understand the GraphQL syntax.
get_editor_link()
This method returns the link to a GraphiQL window with the query. From the window, you can use the user interface to explore other fields and refine your query. Method of Query class.
from rcsbapi.data import Query
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["exptl"])
print(query.get_editor_link())
get_unique_fields()
Given a redundant field, this method returns a list of matching fields in dot notation. You can look through the list to identify your intended field. Method of Schema class.
from rcsbapi.data import Schema
schema = Schema()
schema.get_unique_fields("id")
find_field_names()
Given a string, this method will return all fields containing that string, along with a description of each field.
from rcsbapi.data import Schema
schema = Schema()
schema.find_field_names("exptl")
get_input_id_dict()
Given an input_type, returns a dictionary with the corresponding keys and descriptions of each key. Method of Schema class.
from rcsbapi.data import Schema
schema = Schema()
schema.get_input_id_dict("polymer_entity_instance")
Trouble-shooting
ValueError: Not a unique field
Some fields are redundant within our GraphQL Data API schema. For example, "id" appears over 50 times. To allow for specific querying, redundant fields are identified by the syntax <type>.<field name>
. If you request a redundant field without this syntax, a ValueError
will be returned stating that the field exists, but is redundant. You can then use get_unique_fields("<field name>")
to find notation that would specify a unique field for the given name.
from rcsbapi.data import Query
# querying a redundant field
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["id"])
query.exec()
> ValueError: "id" exists, but is not a unique field, must specify further. To find valid fields with this name, run: get_unique_fields("id")
from rcsbapi.data import Schema
# Run get_unique_field("<field name>")
schema = Schema()
print(schema.get_unique_fields("id"))
['PdbxStructSpecialSymmetry.id',
'RcsbBirdCitation.id',
'ChemComp.id',
'Entry.id',
...
'RcsbUniprotKeyword.id',
'RcsbPolymerInstanceAnnotationAnnotationLineage.id',
'RcsbPolymerStructConn.id']
from rcsbapi.data import Query
# valid Query
query = Query(input_ids={"entry_id":"4HHB"},input_type="entry", return_data_list=["Entry.id"])
query.exec()
Implementation Details
Parsing Schema
Upon initialization of the package, the GraphQL schema is fetched from the RCSB PDB website. After fetching the file, the Python package parses the schema and creates a graph object to represent it within the package. This graph representation of how fields and types connect is key to how queries are automatically constructed using a shortest path algoritm. By default the graph is constructed as a directed graph in rustworkx, but if an ImportError
is encountered, a NetworkX directed graph is created instead.
Constructing queries
Queries are constructed by finding the shortest path from an input_type
to each item in the return_data_list
. The name of each field in the path is found and used to construct a GraphQL query. Currently, constructing queries is not implemented using Networkx and only rustworkx is supported.
Error Handling
In GraphQL, all requests return HTTP status code 200 and instead errors appear in the JSON that is returned. The package will parse these errors, throwing a ValueError and displaying the corresponding error message or messages. To access the full query and return JSON in an interactive editor, you can use the get_editor_link()
method on the Query object. (see Helpful Methods)
Additional examples
Examples come from RCSB PDB Data API documentation
Entries
Fetch information about structure title and experimental method for PDB entries:
{
entries(entry_ids: ["1STP", "2JEF", "1CDG"]) {
rcsb_id
struct {
title
}
exptl {
method
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"entry_ids": ["1STP","2JEF","1CDG"]},input_type="entries", return_data_list=["CoreEntry.rcsb_id", "Struct.title", "Exptl.method"])
query.exec()
To find more about the return_data_list dot notation, see ValueError: Not a unique field
Primary Citation
Fetch primary citation information (structure authors, PubMed ID, DOI) and release date for PDB entries:
{
entries(entry_ids: ["1STP", "2JEF", "1CDG"]) {
rcsb_id
rcsb_accession_info {
initial_release_date
}
audit_author {
name
}
rcsb_primary_citation {
pdbx_database_id_PubMed
pdbx_database_id_DOI
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"entry_ids": ["1STP","2JEF","1CDG"]},input_type="entries", return_data_list=["CoreEntry.rcsb_id", "RcsbAccessionInfo.initial_release_date", "AuditAuthor.name", "RcsbPrimaryCitation.pdbx_database_id_PubMed", "RcsbPrimaryCitation.pdbx_database_id_DOI"])
query.exec()
Polymer Entities
Fetch taxonomy information and information about membership in the sequence clusters for polymer entities:
{
polymer_entities(entity_ids:["2CPK_1","3WHM_1","2D5Z_1"]) {
rcsb_id
rcsb_entity_source_organism {
ncbi_taxonomy_id
ncbi_scientific_name
}
rcsb_cluster_membership {
cluster_id
identity
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"entity_ids":["2CPK_1","3WHM_1","2D5Z_1"]},input_type="polymer_entities", return_data_list=["CorePolymerEntity.rcsb_id", "RcsbEntitySourceOrganism.ncbi_taxonomy_id", "RcsbEntitySourceOrganism.ncbi_scientific_name", "cluster_id", "identity"])
query.exec()
Polymer Instances
Fetch information about the domain assignments for polymer entity instances:
{
polymer_entity_instances(instance_ids: ["4HHB.A", "12CA.A", "3PQR.A"]) {
rcsb_id
rcsb_polymer_instance_annotation {
annotation_id
name
type
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"instance_ids":["4HHB.A", "12CA.A", "3PQR.A"]},input_type="polymer_entity_instances", return_data_list=["CorePolymerEntityInstance.rcsb_id", "RcsbPolymerInstanceAnnotation.annotation_id", "RcsbPolymerInstanceAnnotation.name", "RcsbPolymerInstanceAnnotation.type"])
query.exec()
Carbohydrates
Query branched entities (sugars or oligosaccharides) for commonly used linear descriptors:
{
branched_entities(entity_ids:["5FMB_2", "6L63_3"]) {
pdbx_entity_branch {
type
}
pdbx_entity_branch_descriptor {
type
descriptor
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"entity_ids":["5FMB_2", "6L63_3"]},input_type="branched_entities", return_data_list=["PdbxEntityBranch.type","PdbxEntityBranchDescriptor.type","PdbxEntityBranchDescriptor.descriptor"])
query.exec()
Sequence Positional Features
Sequence positional features describe regions or sites of interest in the PDB sequences, such as binding sites, active sites, linear motifs, local secondary structure, structural and functional domains, etc. Positional annotations include depositor-provided information available in the PDB archive as well as annotations integrated from external resources (e.g. UniProtKB).
This example queries 'polymer_entity_instances' positional features. The query returns features of different type: for example, CATH and SCOP classifications assignments integrated from UniProtKB data, or the secondary structure annotations from the PDB archive data calculated by the data-processing program called MAXIT (Macromolecular Exchange and Input Tool) that is based on an earlier ProMotif implementation.
{
polymer_entity_instances(instance_ids: ["1NDO.A"]) {
rcsb_id
rcsb_polymer_instance_feature {
type
feature_positions {
beg_seq_id
end_seq_id
}
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"instance_ids":["1NDO.A"]},input_type="polymer_entity_instances", return_data_list=["CorePolymerEntityInstance.rcsb_id", "RcsbPolymerInstanceFeature.type", "RcsbPolymerInstanceFeatureFeaturePositions.beg_seq_id", "RcsbPolymerInstanceFeatureFeaturePositions.end_seq_id"])
query.exec()
Reference Sequence Identifiers
This example shows how to access identifiers related to entries (cross-references) and found in data collections other than PDB. Each cross-reference is described by the database name and the database accession. A single entry can have cross-references to several databases, e.g. UniProt and GenBank in 7NHM, or no cross-references, e.g. 5L2G:
{
entries(entry_ids:["7NHM", "5L2G"]){
polymer_entities {
rcsb_id
rcsb_polymer_entity_container_identifiers {
reference_sequence_identifiers {
database_accession
database_name
}
}
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"entry_ids": ["7NHM", "5L2G"]}, input_type="entries", return_data_list=["CoreEntry.rcsb_id", "RcsbPolymerEntityContainerIdentifiersReferenceSequenceIdentifiers.database_accession", "RcsbPolymerEntityContainerIdentifiersReferenceSequenceIdentifiers.database_name"])
query.exec()
Chemical Components
Query for specific items in the chemical component dictionary based on a given list of CCD ids:
{
chem_comps(comp_ids:["NAG", "EBW"]) {
rcsb_id
chem_comp {
type
formula_weight
name
formula
}
rcsb_chem_comp_info {
initial_release_date
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"comp_ids":["NAG", "EBW"]}, input_type="chem_comps", return_data_list=["CoreChemComp.rcsb_id","ChemComp.type","ChemComp.formula_weight","ChemComp.name","ChemComp.formula","RcsbChemCompInfo.initial_release_date"])
query.exec()
Computed Structure Models
This example shows how to get a list of global Model Quality Assessment metrics for AlphaFold structure of Hemoglobin subunit beta:
{
entries(entry_ids: ["AF_AFP68871F1"]) {
rcsb_ma_qa_metric_global {
ma_qa_metric_global {
type
value
}
}
}
}
from rcsbapi.data import Query
query = Query(input_ids={"entry_ids": ["AF_AFP68871F1"]}, input_type="entries", return_data_list=["RcsbMaQaMetricGlobalMaQaMetricGlobal.type", "RcsbMaQaMetricGlobalMaQaMetricGlobal.value"])
query.exec()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rcsb_api-0.2.0.tar.gz
.
File metadata
- Download URL: rcsb_api-0.2.0.tar.gz
- Upload date:
- Size: 130.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82797b0b6093abe095080ef036d2d8b378410480bdf1691e781d78051a31ca9a |
|
MD5 | c88c95615ead3f43bfe58278e50d0aac |
|
BLAKE2b-256 | 657e0051816b9a480da0570415478d04464e5642ee66fa6d8767dee4fbde13e7 |
File details
Details for the file rcsb_api-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: rcsb_api-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 130.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06bc9f23c19ae6ea32b0e3ac019ad9f88cee290243eba112d1d101df3341e8ba |
|
MD5 | 66ee1eb94ac65735524cb5e39a78388b |
|
BLAKE2b-256 | 2d22104d0830be2dc9e9d4651b06510e8a63a14b9c352ba13bc4f8a0e6cb57cd |