Skip to main content

VICC normalization routine for diseases

Project description

Disease Normalizer

Services and guidelines for normalizing disease terms

Installation

The Disease Normalizer is available via PyPI:

pip install disease-normalizer[etl,pg]

The [etl,pg] argument tells pip to install packages to fulfill the dependencies of the gene.etl package and the PostgreSQL data storage implementation alongside the default DynamoDB data storage implementation.

External requirements

The Disease Normalizer can retrieve most required data itself. The exception is disease terms from OMIM, for which a source file must be manually acquired and placed in the disease/data/omim folder within the library root. In order to access OMIM data, users must submit a request here. Once approved, the relevant OMIM file (mimTitles.txt) should be renamed according to the convention omim_YYYYMMDD.tsv, where YYYYMMDD indicates the date that the file was generated, and placed in the appropriate location.

Database Initialization

The Disease Normalizer supports two data storage options:

  • DynamoDB, a NoSQL service provided by AWS. This is our preferred storage solution. In addition to cloud deployment, Amazon also provides a tool for local service, which can be installed here. Once downloaded, you can start service by running java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb in a terminal (add a -port <VALUE> option to use a different port)
  • PostgreSQL, a well-known relational database technology. Once starting the Postgres server process, ensure that a database is created (we typically name ours disease_normalizer).

By default, the Disease Normalizer expects to find a DynamoDB instance listening at http://localhost:8000. Alternative locations can be specified in two ways:

The first way is to set the --db_url command-line option to the URL endpoint.

disease_norm_update --update_all --db_url="http://localhost:8001"

The second way is to set the DISEASE_NORM_DB_URL environment variable to the URL endpoint.

export DISEASE_NORM_DB_URL="http://localhost:8001"

To use a PostgreSQL instance instead of DynamoDB, provide a PostgreSQL connection URL instead, e.g.

export DISEASE_NORM_DB_URL="postgresql://postgres@localhost:5432/disease_normalizer"

Adding and refreshing data

Use the disease_norm_update command in a shell to update the database.

Update source(s)

The Disease Normalizer currently uses data from the following sources:

As described above, all source data other than OMIM can be acquired automatically.

To update one source, simply set --sources to the source you wish to update. The normalizer will check to see if local source data is up-to-date, acquire the most recent data if not, and use it to populate the database.

For example, run the following to acquire the latest NCIt data if necessary, and update the NCIt disease records in the normalizer database:

disease_norm_update --sources="ncit"

To update multiple sources, you can use the --sources option with the source names separated by spaces.

Update all sources

To update all sources, use the --update_all flag:

disease_norm_update --update_all

Create Merged Concept Groups

The normalize endpoint relies on merged concept groups.

To create merged concept groups, use the --update_merged flag with the --update_all flag.

python3 -m disease.cli --update_all --update_merged

Starting the disease normalization service

Once the Disease Normalizer database has been loaded, from the project root, run the following:

uvicorn disease.main:app --reload

Next, view the OpenAPI docs on your local machine:

http://127.0.0.1:8000/disease

Developer instructions

Following are sections include instructions specifically for developers.

Installation

For a development install, we recommend using Pipenv. See the pipenv docs for direction on installing pipenv in your compute environment.

To get started, clone the repo and initialize the environment:

git clone https://github.com/cancervariants/disease-normalization
cd disease-normalization
pipenv shell
pipenv update
pipenv install --dev

Alternatively, install the pg, etl, dev, and test dependency groups in a virtual environment:

git clone https://github.com/cancervariants/gene-normalization
cd gene-normalization
python3 -m virtualenv venv
source venv/bin/activate
pip install -e ".[pg,etl,dev,test]"

Init coding style tests

Code style is managed by Ruff and Black and checked prior to commit.

This performs checks for:

  • Code style
  • File endings
  • Added large files
  • AWS credentials
  • Private keys

Before first commit run:

pre-commit install

Running unit tests

Tests are provided via pytest.

pytest

By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the DISEASE_TEST environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.

export DISEASE_TEST=true
pytest

Sometimes, sources will update their data, and our test fixtures and data will become incorrect. The tests/scripts/ subdirectory includes scripts to rebuild data files, although most fixtures will need to be updated manually.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

disease-normalizer-0.4.0.dev0.tar.gz (43.4 kB view details)

Uploaded Source

Built Distribution

disease_normalizer-0.4.0.dev0-py3-none-any.whl (52.0 kB view details)

Uploaded Python 3

File details

Details for the file disease-normalizer-0.4.0.dev0.tar.gz.

File metadata

File hashes

Hashes for disease-normalizer-0.4.0.dev0.tar.gz
Algorithm Hash digest
SHA256 5a7045b5674623b5e0a6eeecc0c9586027537fbad61c6796358383837f6abbfc
MD5 802e9a8bb527600b19ccd975ab631fcc
BLAKE2b-256 ec92bb3306dba45a897b930795454c93ed9df67a4c4b286408c0f414996d62bc

See more details on using hashes here.

File details

Details for the file disease_normalizer-0.4.0.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for disease_normalizer-0.4.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 45bb086f3819dac9a4210325cfb4b5b4201c5670c34f8624494b46e2eb23a974
MD5 789f4b92292f7b3f53dba913115d26c7
BLAKE2b-256 cf5eaef9b222e84c6b317ef7ae62c83f1958f4594362f18000171b583534b0b8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page