Skip to main content

A python package to retrieve bioinformatics given Uniprot IDs.

Project description

pyuniprot

pyuniprot parses a UniProt text file given a Uniprot ID into a python object. All information is made programmatically accessible when programming in python, the most used programming language in bioinformatics.

The text file format of the UniProt different API formats was chosen because we could download the whole dataset as text and run local queries without worrying about internet connection or UniProt API limits.

Through the python object, mostly a dictionary of different categories and each category is wrapped as a dataclass so that inside attributes are easy to access through dot notations. Convenient functions will be provided for some common usage.

Contributions are highly welcomed!

Install

pip install pyuniprotkb

Quick Start

1. Read a local file or download from UniprotKB and save to the current directory

from pyuniprot import Uniprot
uniprot_id = "P01116"
uniprot = Uniprot(
    uniprot_id, local_download_dir='./', save_txt=True)
)  # If './P01116.txt' exists, pyuniprot reads it directly; if not pyuniprot downloads it from UniprotKB first and (optionally) saves it to './'
print(uniprot.category_lines)  # It prints out all the information for a Uniprot ID that UniprotKB has for it.

2. Access category information

print(uniprot.category_lines['AC'])
# result: AC(primary_accession='P01116;', secondary_accessions=['A8K8Z5', 'B0LPF9', 'P01118', 'Q96D10'])

Currently, most information can only be accessed from the category_lines python dictionary. In this dictionary, the keys are just the two-letter codes used by UniprotKB.

They are (in my understanding):

  • ID: UniprotKB identifers
  • AC: accession numbers (the example above)
  • DT: entry brief history
  • DE: protein names/descriptions
  • GN: gene name (by HGNC?)
  • OS,OC,OX (grouped as OSCX in pyuniprot): organism names
  • RN,RP,RC,RX,RA,RT,RL (grouped as RNPCXATL in pyuniprot): references
  • CC: activity comments
  • DR: database cross references
  • PE: protein evidence level
  • KW: protein keywords
  • FT: protein feature tables
  • SQ: protein sequence

3. Get the UniRef data with certain sequence identity threshold. This data from UniProt provids reference clustered sets of sequences.

from pyuniprot import UniRef
uniprot_id = "P36952"
uniref = UniRef(uniprot_id, local_download_dir='./') # default sequence identity threshold is 100
print(uniref.members)  # We get the following result
# [{"memberIdType": "UniParc", "memberId": "UPI0003EAE5A1","organismName": "Homo sapiens", "organismTaxId": 9606, "sequenceLength": 204,"proteinName": "serpin B5 isoform X1", "uniref90Id": "UniRef90_P36952"}]

# If we change the `thresh` keyword argument in the UniRef class to 90 or 50, we can get the data for clusters with 90 or 50 sequence identity.
uniref_90 = UniRef(uniprot_id, thresh=90, local_download_dir='./')
uniref_50 = UniRef(uniprot_id, thresh=50, local_download_dir='./')
print(len(uniref_50.members)) # now we get the following number of members in the cluster
# 218

Contributing

  1. Fork this repository
  2. Create a branch for your fix of an issue/feature/doc (git checkout -b my-fix)
  3. Stage your changes, run pre-commit, and run pytest until every thing looks fine
  4. Commit your changes (git commit -am 'Added some feature/fix/doc')
  5. Push to the branch (git push origin my-fix)
  6. Create new Pull Request

For code quality control, we use pre-commit hooks for formatting and type hints.

Try your best to add test cases for bug fixes and new features. We use pytest for testing.

For documentation, we should try to follow the Google Python docstrings style.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyuniprotkb-0.4.1.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

pyuniprotkb-0.4.1-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file pyuniprotkb-0.4.1.tar.gz.

File metadata

  • Download URL: pyuniprotkb-0.4.1.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for pyuniprotkb-0.4.1.tar.gz
Algorithm Hash digest
SHA256 8ace7f1b7ad75e1bad104df5dc511614b39542c7fae39eb5a8406dccdd100451
MD5 32c8f2cdb1d4f3c810aff8ef55417791
BLAKE2b-256 6df5bcb5cc74923a8080fba87dbda07f38155eff7692f7ffde32004402d98def

See more details on using hashes here.

File details

Details for the file pyuniprotkb-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: pyuniprotkb-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for pyuniprotkb-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3b3bfae4431193eac0f7afceae90520832118953249e5b37c782b4f91948af00
MD5 515f1466597eee46d4d107750278447d
BLAKE2b-256 aef05c40dcb4719208c5bb20b4cc26021d1822039fb1082de5caba450ad40279

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page