Skip to main content

VICC normalization routine for variations

Project description

Variation Normalization

image image image Actions statusDOI

The Variation Normalizer parses and translates free-text descriptions of genomic variations into computable objects conforming to the Variation Representation Specification (VRS), enabling consistent and accurate variant harmonization across a diversity of genomic knowledge resources.


Live OpenAPI endpoint


Installation

Install from PyPI:

python3 -m pip install variation-normalizer

variation-normalization branch variation-normalizer version gene-normalizer version VRS version
main >=0.14.Z >=0.9.Z 2.0

About

Variation Normalization works by using four main steps: tokenization, classification, validation, and translation. During tokenization, we split strings on whitespace and parse to determine the type of token. During classification, we specify the order of tokens a classification can have. We then do validation checks such as ensuring references for a nucleotide or amino acid matches the expected value and validating a position exists on the given transcript. During translation, we return a VRS Allele object.

Variation Normalization is limited to the following types of variants:

  • HGVS expressions and text representations (ex: BRAF V600E):
    • protein (p.): substitution, deletion, insertion, deletion-insertion
    • coding DNA (c.): substitution, deletion, insertion, deletion-insertion
    • genomic (g.): substitution, deletion, ambiguous deletion, insertion, deletion-insertion, duplication
  • gnomAD-style VCF (chr-pos-ref-alt, ex: 7-140753336-A-T)
    • genomic (g.): substitution, deletion, insertion

Variation Normalizer accepts input from GRCh37 or GRCh8 assemblies.

We are working towards adding more types of variations, coordinates, and representations.

VRS Versioning

The variation-normalization repo depends on VRS models, and therefore each variation-normalizer package on PyPI uses a particular version of VRS. The correspondences between packages may be summarized as:

variation-normalization branch variation-normalizer version gene-normalizer version VRS version
main >=0.14.Z >=0.9.Z 2.0

Previous VRS Versioning

The correspondences between the packages that are no longer maintained may be summarized as:

variation-normalization branch variation-normalizer version gene-normalizer version VRS version
vrs-1.3 0.6.Z 0.1.Z 1.3

Available Endpoints

/to_vrs

Returns a list of validated VRS Variations.

/normalize

Returns a VRS Variation aligned to the prioritized transcript. The Variation Normalizer relies on Common Operations On Lots-of Sequences Tool (cool-seq-tool) for retrieving the prioritized transcript data. More information on the transcript selection algorithm can be found here.

If a genomic variation query is given a gene (E.g. BRAF g.140753336A>T), the associated cDNA representation will be returned. This is because the gene provides additional strand context. If a genomic variation query is not given a gene, the GRCh38 representation will be returned.

Development

Clone the repo:

git clone https://github.com/cancervariants/variation-normalization.git
cd variation-normalization

For a development install, we recommend using Pipenv. See the pipenv docs for direction on installing pipenv in your compute environment.

Once installed, from the project root dir, just run:

pipenv shell
pipenv update && pipenv install --dev

Required resources

Variation Normalization relies on some local data caches which you will need to set up. We provide instructions on how to setup your development environment using Docker.

SeqRepo

Variation Normalization relies on seqrepo, which you must download yourself.

Variation Normalizer uses seqrepo to retrieve sequences at given positions on a transcript.

From the root directory:

pip install seqrepo
sudo mkdir /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i 2024-12-20/  # Replace with latest version using `seqrepo list-remote-instances` if outdated

If you get an error similar to the one below:

PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2024-12-20/._fkuefgd' -> '/usr/local/share/seqrepo/2024-12-20/'

You will want to do the following:
(Might not be ._fkuefgd, so replace with your error message path)

sudo mv /usr/local/share/seqrepo/2024-12-20._fkuefgd /usr/local/share/seqrepo/2024-12-20
exit

Use the SEQREPO_ROOT_DIR environment variable to set the path of an already existing SeqRepo directory. The default is /usr/local/share/seqrepo/latest.

UTA

You must download uta_20241220.pgd.gz from https://dl.biocommons.org/uta/ using a web browser and move it to the root of the repository.

Docker Installation (Preferred)

We recommend installing the Variation Normalizer using Docker.

Requirements

Build, (re)create, and start containers

docker volume create uta_vol
docker compose up

[!IMPORTANT] This assumes you have a local SeqRepo installed at /usr/local/share/seqrepo/2024-12-20. If you have it installed elsewhere, please update the SEQREPO_ROOT_DIR environment variable in compose.yaml.
If you're using Docker Desktop, you'll want to go to Settings -> Resources -> File sharing and add /usr/local/share/seqrepo under the Virtual file shares section. Otherwise, you will get the following error: OSError: Unable to open SeqRepo directory /usr/local/share/seqrepo/2024-12-20.

[!TIP] If you want a clean slate, run docker compose down -v to remove containers and volumes, then docker compose up --build to rebuild and start fresh containers.

Point your browser to http://localhost:8001/variation/.

Code QC

Code style is managed by Ruff and checked prior to commit.

To perform formatting and check style:

python3 -m ruff format . && python3 -m ruff check --fix .

We use pre-commit to run conformance tests.

This ensures:

  • Style correctness
  • No large files
  • AWS credentials are present
  • Private key is present

Pre-commit must be installed before your first commit. Use the following command:

pre-commit install

Testing

From the root directory of the repository:

pytest tests/

Dependency management

Production runtime dependencies need to be updated in three places:

  • pyproject.toml declares dependencies for the wheel that's published to PyPI
  • requirements.txt declares dependencies for our Elastic Beanstalk-based deployment
    • Note that it can be trivially regenerated with the command uv pip compile pyproject.toml -o requirements.txt --no-annotate
  • Pipfile is used as a backup for Elastic Beanstalk dependency management

Note that dev/testing dependencies only need to be updated in pyproject.toml.

Creating a new release

  1. Version number must be updated manually. It's declared under project.version in pyproject.toml. Ensure that the version value for the Docker image in compose.yaml is similarly updated.
  2. Once a commit with an updated version is merged to the staging branch, create a new tag + GitHub release (from the staging branch). This triggers the PyPI and GHCR publishing workflows. Presently, new commits to staging should not be merged to main.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

variation_normalizer-0.15.5.tar.gz (100.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

variation_normalizer-0.15.5-py3-none-any.whl (155.4 kB view details)

Uploaded Python 3

File details

Details for the file variation_normalizer-0.15.5.tar.gz.

File metadata

  • Download URL: variation_normalizer-0.15.5.tar.gz
  • Upload date:
  • Size: 100.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for variation_normalizer-0.15.5.tar.gz
Algorithm Hash digest
SHA256 5533ee4a73340ff74f2e156ffab5af17629e5f806b004f204dd959e765f55f5e
MD5 9e462134ee58afa1a03409c7dd47b84d
BLAKE2b-256 401abf209c36e2cf2453a21fefdf294b09ab1f8eb3cc6a8facec2ae549ff3779

See more details on using hashes here.

Provenance

The following attestation bundles were made for variation_normalizer-0.15.5.tar.gz:

Publisher: release.yml on cancervariants/variation-normalization

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file variation_normalizer-0.15.5-py3-none-any.whl.

File metadata

File hashes

Hashes for variation_normalizer-0.15.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e1e7058159bcad0c90c5ed6cbe5caeb0213cbf99145f70780dea121fed42db90
MD5 35ede49bd7dfc7d3f9933a6b738f6d7f
BLAKE2b-256 9cfce5d945f96e82f809f9cefd0805cb866d8fd442b8ba3382ff17fbf1120d16

See more details on using hashes here.

Provenance

The following attestation bundles were made for variation_normalizer-0.15.5-py3-none-any.whl:

Publisher: release.yml on cancervariants/variation-normalization

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page