An ANTLR based parser for colloquial protein variant nomenclature
Project description
Protein Variant Nomenclature Parser
This repository contains a Python library for parsing and validating colloquial protein variant nomenclature
strings like BRAF V600E
that commonly appear in manuscripts.
Features
- Parse protein variant nomenclature strings in the following formats:
- Single amino acid substitution, e.g.:
BRAF V600E
,BRAFV600E
,BRAFᵛ⁶⁰⁰ᵉ
- Range of amino acid substitutions:
BRAFVK600_601>E
- Single amino acid substitution, e.g.:
- Extract the components of the nomenclature string, such as gene name, prefix amino acid, position or range, and suffix amino acid
- Validate whether a given string conforms to the expected format
Usage
For parsing:
from protein_variant_nomenclature_parser.parser import parse
mutation_string = "BRAF V600E"
parsed_components = parse(mutation_string)
print(parsed_components)
ProteinVariant(gene='BRAF', amino_acid_before='V', number_or_range=NumberOrRange(start=600, end=None), amino_acid_after='E')
For validation:
from protein_variant_nomenclature_parser.parser import parse
from protein_variant_nomenclature_parser.parser import InvalidProteinVariantError
mutation_string = "INVALID V600E"
try:
parse(mutation_string)
except InvalidProteinVariantError:
print(f"{mutation_string} is not valid")
Supported Nomenclature
The parser supports all HUGO gene names.
The parser supports the following amino acid single letter codes and stop codon (*).
The parser supports situations where the variant has no space between the gene name in the substitution, which unfortunately comes up sometimes.
Installation
From PyPI
pip install protein-variant-nomenclature-parser
From Source
To install the library, clone the repository and install it using pip
:
git clone https://github.com/yourusername/protein-variant-nomenclature-parser.git
cd protein-variant-nomenclature-parser
make install
Docker container
A docker container is available:
docker pull jeffquinnmsk/protein-variant-nomenclature-parser:latest
License
This project is licensed under the MIT License. See the LICENSE file for more information.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file protein-variant-nomenclature-parser-0.4.0.tar.gz
.
File metadata
- Download URL: protein-variant-nomenclature-parser-0.4.0.tar.gz
- Upload date:
- Size: 55.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9403567539b6bf038abe8dbdb73f63a3316f67220ee9240a2bb49c7e02d17d38 |
|
MD5 | c9e9f3a185aff64b32cc4276973ea60e |
|
BLAKE2b-256 | 054a29391374bbb692c2e834f92b4ef09421c77c952a04229cd0bf058550ba09 |
File details
Details for the file protein_variant_nomenclature_parser-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: protein_variant_nomenclature_parser-0.4.0-py3-none-any.whl
- Upload date:
- Size: 54.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 388936c3052324ce6a41d99788277f32a95fc7775ea174d2884c1712e9062ea1 |
|
MD5 | 47f776c082b06b7522474c4c18918414 |
|
BLAKE2b-256 | 2acc9d052891cda51ca333fb0b711dcea43531ab687b06d4b250a8c4545fe58f |