Skip to main content

A package to convert gene identifiers between different naming conventions

Project description

BioRosetta

This is a package to map gene names between different naming conventions.

Motivation: while there are popular packages for gene identifier mapping (e.g. ENSG, NCBI, HGNC) in the R environment (e.g. AnnotationDB), there is no standard solution in python.

Package Features

import biorosetta as br
  • Source-based system: Instead of relying on a single repository for mapping gene identifiers, biorosetta integrates results from different repositories, or "sources". Biorosetta supports two types of sources under the same interface. (1) Local sources: biorosetta downloads a local version of the conversion tables from popular repositories (Ensembl Biomart, HGNC Biomart). Best option for highly reproducible gene conversion outputs that do not change over time (e.g. for scientific article preparation). (2) Remote sources: biorosetta interfaces to remote web service applications (MyGene) to convert gene names. Best option for highly up-to-date conversion.
  • Priority system: The user can specify an order of source priority, so that when different sources produce a different conversion output it is possible to define the most trusted result.
  • Conversion report: For critical gene mapping applications, biorosetta can optionally generate a report table that specifies the mapping output of each separate source and highlight where there have been mismatches between outputs, so that these mapping results can be investigated further.
  • Multi-hits policies: When multiple possible mapping outputs ("hits") are found, one can choose the policy for integrating them: "all": concatenates all the ID outputs with a pipe ("|") symbol (e.g. "foo|bar|baz"). "consensus": compares the output hits across different sources to select the ID that appears most frequently across different sources. E.g. if source A outputs "foo|bar" and source B outputs "bar|baz" then "bar" is selected as the final output.
  • Gene name synonyms: Gene symbols have many synonyms. When available within the source, biorosetta integrates the synonym information for one-way mapping from gene symbol to any other ID.
  • Speed: Biorosetta performs vectorized operations to achieve maximum efficiency.

Currently implemented sources and gene identifiers

Sources:

Gene Identifiers:

  • "ensg": Ensembl gene ID (all sources)
  • "entr": NCBI gene ID (entrezgene, all sources)
  • "symb": Gene name (symbol, all sources)
  • "ensp": Ensembl protein ID (ENSP, Ensembl Biomart only)
  • "hgnc": HGNC ID (Ensembl Biomart and HGNC Biomart only)

Usage

See up-to-date documentation and examples at github repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biorosetta-0.3.2.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

biorosetta-0.3.2-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file biorosetta-0.3.2.tar.gz.

File metadata

  • Download URL: biorosetta-0.3.2.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for biorosetta-0.3.2.tar.gz
Algorithm Hash digest
SHA256 8bd0a812377f49124e42ea0ed8f6eff60ea54b73812aa4a4b2eef1364393aa78
MD5 39e50131b485c98060f6a29096d813ea
BLAKE2b-256 499e896cb5c1f1a27de6ab4f008df018d82ddcb633f44201da818a4522e0cdb8

See more details on using hashes here.

File details

Details for the file biorosetta-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: biorosetta-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for biorosetta-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 80d3e8d8cf3af64ff004b40a864251d3db8fbf3fd7f7e5d166bf706de7680728
MD5 dda97652fd3472e9bb02cd7baeea84c0
BLAKE2b-256 75ed5326fb5aa1e6864c3fce932a8e14aeaab600e68126b9fff06baeed09e868

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page