Skip to main content

Map biological sequence variants (mutations) to their equivalent chromosome, cDNA, gene, exon, protein, and RNA positions.

Project description

pyvariant

What is it?

pyvariant is a Python package for mapping biological sequence variants (mutations) to their equivalent chromosome, cDNA, gene, exon, protein, and RNA positions.

How to get it

The easiest way to get pyvariant is using pip:

pip install pyvariant

The source code is hosted on GitHub at: https://github.com/mattdoug604/pyvariant

How to use it

Before you can use pyvariant, you will need to download the necessary genomic data.

To download and install data from Ensembl, run:

pyvariant install --species <species-name> --release <Ensembl-release-number>

For example:

pyvariant install --species 'homo sapiens' --release 100

At the time of writing, installing a human dataset takes roughly 30-45 minutes and 1.5G of storage space. However, the actual time and space required to install a dataset will depend entirely on the size of the dataset, your computer, internet speed, etc.

By default, the data is downloaded to a platform-specific data directory that is generally only accessible by the user (e.g. /home/<you>/.local/share/pyvariant/). If you want the data to be accessible to other users, or your home directory does have enough storage space, you may to specify a different directory to download to with the --cache option:

pyvariant install --species homo_sapiens --release 100 --cache /path/to/cache/

For more options, run:

pyvariant install --help

Alternatively, you can run the installation from inside a Python process:

>>> from pyvariant import EnsemblRelease
>>> ensembl100 = EnsemblRelease(species='homo_sapiens', release=100, cache_dir="/path/to/cache/")
>>> ensembl100.install()

Once the data is installed, the EnsemblRelease object provides methods for getting information about different features, getting the DNA/RNA/protein sequence at a position, and converting between positions, etc.

Map between feature types

The main use of the package is for converting between equivalent cDNA, DNA, exon, protein, and RNA positions. For example:

cDNA to DNA:

>>> ensembl100.to_dna("ENST00000310581:c.-124C>T")
[DnaSubstitution(refseq='C', altseq='T', contig_id='5', start=1295113, start_offset=0, end=1295113, end_offset=0, strand='-')]

Protein to DNA:

>>> ensembl100.to_dna("BRCA1:p.Q1458*")
[DnaDelins(refseq='CAG', altseq='TAA', contig_id='17', start=43076598, start_offset=0, end=43076600, end_offset=0, strand='-'), DnaDelins(refseq='CAG', altseq='TGA', contig_id='17', start=43076598, start_offset=0, end=43076600, end_offset=0, strand='-'), DnaSubstitution(refseq='C', altseq='T', contig_id='17', start=43076600, start_offset=0, end=43076600, end_offset=0, strand='-')]

Exon to RNA:

>>> ensembl100.to_rna("ENST00000266970:e.7::ENST00000360299:e.2")
[RnaFusion(breakpoint1=RnaPosition(contig_id='12', start=972, start_offset=0, end=2240, end_offset=0, strand='+', gene_id='ENSG00000123374', gene_name='CDK2', transcript_id='ENST00000266970', transcript_name='CDK2-201'), breakpoint2=RnaPosition(contig_id='12', start=63, start_offset=0, end=317, end_offset=0, strand='+', gene_id='ENSG00000111540', gene_name='RAB5B', transcript_id='ENST00000360299', transcript_name='RAB5B-201'))]

You can also limit mapping to the canonical transcript only:

>>> ensembl100 = EnsemblRelease(species='homo_sapiens', release=100, canonical_transcript=["ENST00000000233"])
>>> ensembl100.to_cdna("7:g.127589084", canonical=False)
[CdnaPosition(contig_id='7', start=69, start_offset=0, end=69, end_offset=0, strand='+', gene_id='ENSG00000004059', gene_name='ARF5', transcript_id='ENST00000000233', transcript_name='ARF5-201', protein_id='ENSP00000000233'), CdnaPosition(contig_id='7', start=69, start_offset=0, end=69, end_offset=0, strand='+', gene_id='ENSG00000004059', gene_name='ARF5', transcript_id='ENST00000415666', transcript_name='ARF5-202', protein_id='ENSP00000412701')]
>>> ensembl100.to_cdna("7:g.127589084", canonical=True)
[CdnaPosition(contig_id='7', start=69, start_offset=0, end=69, end_offset=0, strand='+', gene_id='ENSG00000004059', gene_name='ARF5', transcript_id='ENST00000000233', transcript_name='ARF5-201', protein_id='ENSP00000000233')]

Normalize a variant to each possible type:

>>> ensembl100.to_all("ENST00000269305:c.376-2A>G")
[{'cdna': CdnaSubstitution(_core=EnsemblRelease(species=homo_sapiens, release=100), refseq='A', altseq='G', contig_id='17', start=376, start_offset=-2, end=376, end_offset=-2, strand='-', gene_id='ENSG00000141510', gene_name='TP53', transcript_id='ENST00000269305', transcript_name='TP53-201', protein_id='ENSP00000269305'), 'dna': DnaSubstitution(_core=EnsemblRelease(species=homo_sapiens, release=100), refseq='A', altseq='G', contig_id='17', start=7675238, start_offset=0, end=7675238, end_offset=0, strand='-'), 'exon': None, 'protein': None, 'rna': RnaSubstitution(_core=EnsemblRelease(species=homo_sapiens, release=100), refseq='A', altseq='G', contig_id='17', start=566, start_offset=-2, end=566, end_offset=-2, strand='-', gene_id='ENSG00000141510', gene_name='TP53', transcript_id='ENST00000269305', transcript_name='TP53-201')}]

Check if two variants are equivalent

Get the notation(s) that represent both variants:

>>> x = ensembl100.same("ENSP00000358548:p.Q61K", "NRAS:c.181C>A")
>>> x.keys()
dict_keys(['cdna', 'dna', 'exon', 'protein', 'rna'])
>>> x["dna"]
[DnaSubstitution(refseq='C', altseq='A', contig_id='1', start=114713909, start_offset=0, end=114713909, end_offset=0, strand='-')]

...or get the notation(s) that are unique to each variant:

>>> x = ensembl100.diff("ENSP00000358548:p.Q61K", "NRAS:c.181C>A")
>>> x.keys()
dict_keys(['cdna', 'dna', 'exon', 'protein', 'rna'])
>>> x["dna"]
([DnaDelins(refseq='CAA', altseq='AAG', contig_id='1', start=114713907, start_offset=0, end=114713909, end_offset=0, strand='-')], [])

Fetch sequences

Get the mutated reference sequence, within a given window:

>>> ensembl100.sequence("ENST00000635293:c.1044A>C", window=50)
CGCCTCTTTCAGAGACTTTTAACTTCAACATCTGTCCCTACCCAGCAGGC

The sequence can also be normalized to a specific strand of the genome:

>>> ensembl100.sequence("ENST00000635293:c.1044A>C", window=50, strand='+')
GCCTGCTGGGTAGGGACAGATGTTGAAGTTAAAAGTCTCTGAAAGAGGCG

Get the sequence surrounding a fusion breakpoint:

>>> ensembl100.sequence("ENST00000399410:r.2871::ENST00000561813:r.317", window=50)
ACAGTGCAGGGAAGCAACTGCAGAGGCTGTGCAATCTTGCACAAATATCT

Retrieve feature information

pyvariant also has functions for retrieving general information about various features. For example:

Get a list of transcript IDs for a gene:

>>> ensembl100.transcript_ids("BRCA2")
['ENST00000380152', 'ENST00000470094', 'ENST00000528762', 'ENST00000530893', 'ENST00000533776', 'ENST00000544455', 'ENST00000614259', 'ENST00000665585', 'ENST00000666593', 'ENST00000670614', 'ENST00000671466']

For a complete list of methods, run:

>>> help(EnsemblRelease)

Notes

Variant naming standards

This package follows HGVS nomenclature recommendations for representing variants.

Offset positions

When describing variants in an intron or UTR, it can be more informative to describe the position relative to a transcript, rather than the genome. These positions are described with a "+" or "-". For example, "TERT:c.-125" means "a position 125 nucleotides 5’ of the ATG translation initiation codon." See the HGVS nomenclature documentation for more information.

Protein duplications

By default, the package assumes that protein duplications are the result of a nucleotide insertion, as opposed to a delins. This behaviour can be turned of by defining the environmental variable PYVARIANT_GET_ALL_PROTEIN_DUPS.

License

This package is distributed with the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyvariant-2.2.2.tar.gz (64.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyvariant-2.2.2-py3-none-any.whl (50.2 kB view details)

Uploaded Python 3

File details

Details for the file pyvariant-2.2.2.tar.gz.

File metadata

  • Download URL: pyvariant-2.2.2.tar.gz
  • Upload date:
  • Size: 64.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pyvariant-2.2.2.tar.gz
Algorithm Hash digest
SHA256 19b9af4edd81315545d81181e7b3a197b1719256871033a73d35bfcadc97e36c
MD5 a1c7358c852593f34d4ce28b2c9e31b4
BLAKE2b-256 febbff8b788f1d38bfc9222e1accc883569ff51faaf20a020f6e93f862b04356

See more details on using hashes here.

File details

Details for the file pyvariant-2.2.2-py3-none-any.whl.

File metadata

  • Download URL: pyvariant-2.2.2-py3-none-any.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for pyvariant-2.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 67686bdc439af33dee1a2c173757a4f837a1dacc9b1c9acc69d866ec3b0014df
MD5 3e67d00fcf4fd60ce8ad7b1c0d13e685
BLAKE2b-256 0661c520480ec0061fc28a1e74ed00f3c27cb35c00f6ec7d45c41f9909ac9a12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page