Importing and querying UniProt
Project description
PyUniProt is a Python package to access and query chemical–gene/protein interactions, chemical–disease and gene–disease relationships by data provided by the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR).
Data are installed in a (local or remote) RDBMS enabling bioinformatic algorithms very fast response times to sophisticated queries and high flexibility by using SOLAlchemy database layer. PyUniProt is developed by the Department of Bioinformatics at the Fraunhofer Institute for Algorithms and Scientific Computing SCAI For more in for information about pyUniProt go to the documentation.
This development is supported by following IMI projects:
Supported databases
PyUniProt uses SQLAlchemy to cover a wide spectrum of RDMSs (Relational database management system). For best performance MySQL or MariaDB is recommended. But if you have no possibility to install software on your system SQLite - which needs no further installation - also works. Following RDMSs are supported (by SQLAlchemy):
Firebird
Microsoft SQL Server
MySQL / MariaDB
Oracle
PostgreSQL
SQLite
Sybase
Getting Started
This is a quick start tutorial for impatient.
Installation
PyUniProt can be installed with pip.
pip install pyuniprot
If you fail because you have no rights to install use superuser (sudo on Linux before the commend) or …
pip install --user pyuniprot
If you want to make sure you are installing this under python3 use …
python3 -m pip install pyuniprot
SQLite
If you don’t know what all that means skip the section MySQL/MariaDB setup.
Don’t worry! You can always later change the configuration. For more information about changing database system later go to the subtitle Changing database configuration Changing database configuration in the documentation on readthedocs.
MySQL/MariaDB setup
Log in MySQL as root user and create a new database, create a user, assign the rights and flush privileges.
CREATE DATABASE pyuniprot CHARACTER SET utf8 COLLATE utf8_general_ci;
GRANT ALL PRIVILEGES ON pyuniprot.* TO 'pyuniprot_user'@'%' IDENTIFIED BY 'pyuniprot_passwd';
FLUSH PRIVILEGES;
Start a python shell and set the MySQL configuration. If you have not changed anything in the SQL statements …
import pyuniprot
pyuniprot.set_mysql_connection()
If you have used you own settings, please adapt the following command to you requirements.
import pyuniprot
pyuniprot.set_mysql_connection()
pyuniprot.set_mysql_connection(host='localhost', user='pyuniprot_user', passwd='pyuniprot_passwd', db='pyuniprot')
Updating
The updating process will download the uniprot_sprot.xml.gz file provided by the UniProt team on their ftp server download page
It is strongly recommended to restrict the entries liked to specific organisms your are interested in by parsing a list of NCBI Taxonomy IDs to the parameter taxids. To identify correct NCBI Taxonomy IDs please go to NCBI Taxonomy web form. In the following example we use 9606 as identifier for Homo sapiens, 10090 for Mus musculus and 10116 for Rattus norvegicus.
import pyuniprot
pyuniprot.update(taxids=[9606, 10090, 10116])
If you want to load all UniProt entries in the database:
import pyuniprot
pyuniprot.update()
The update uses the download if it still exists on you system (~/.pyuniprot/data/uniprot_sprot.xml.gz). If you use the parameter force_download the current file from UniProt will be downloaded.
import pyuniprot
pyuniprot.update(force_download=True)
Quick start with query functions
Initialize the query object
query = pyuniprot.query()
Get all entries
all_entries = query.entry()
Use parameters like gene_name to find specific entries
>>> entry = query.entry(gene_name='YWHAE', taxid=9606, recommended_short_name='14-3-3E', name='1433E_HUMAN')[0]
>>> entry
14-3-3 protein epsilon
- Entry is the root element in the database. Form here you can reach all other data
>>> entry.accessions [P62258, B3KY71, D3DTH5, P29360, P42655, Q4VJB6, Q53XZ5, Q63631, Q7M4R4] >>> entry.functions ["Adapter protein implicated in the regulation of a large spectrum of both ..."]
- If a parameter ends on a s you can search
>>> alcohol_dehydrogenases = q.entry(ec_numbers='1.1.1.1') >>> [x.name for x in q.get_entry(ec_numbers='1.1.1.1')] ['ADHX_RAT', 'ADH1_RAT', 'ADHX_HUMAN', 'ADHX_MOUSE'] >>> query.entry(ec_numbers=('1.1.1.1', '1.1.1.2')) ['Adh5', 'Adh1', 'ADH5', 'Adh5', 'Adh6', 'ADH7', 'Adh7', 'Adh7', 'Adh1']
As dataframe with a limit of 10 and accession number starts with Q9 (% used as wildcard)
>>> query.accession(as_df=True, limit=3, accession='Q9%')
id accession entry_id
0 1 Q9CQV8 1
1 32 Q9GIK8 6
2 33 Q9TQB4 6
More information
See the installation documentation for more advanced
instructions. Also, check the change log at CHANGELOG.rst
.
UniProt tools and licence (use of data)
UniProt provides also many online query interfaces on their website.
Please be aware of the UniProt licence.
Links
Universal Protein Resource (UniProt)
PyUniProt
Documented on Read the Docs
Versioned on GitHub
Tested on Travis CI
Distributed by PyPI
Chat on Gitter
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for PyUniProt-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cee642101187779440335b406cc02bc320f55ebdd93815f8362e11692285d781 |
|
MD5 | ec7be1a96c035637c34c38461e86063d |
|
BLAKE2b-256 | 50d373bcc14bea398c7be6a79afa983a54a3fed969db967cf13d50fc739c470b |