VICC normalization routines for diseases

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Disease Normalizer

Services and guidelines for normalizing disease terms

Installation

The Disease Normalizer is available via PyPI:

pip install disease-normalizer[etl,pg]

The [etl,pg] argument tells pip to install packages to fulfill the dependencies of the gene.etl package and the PostgreSQL data storage implementation alongside the default DynamoDB data storage implementation.

External requirements

The Disease Normalizer can retrieve most required data itself, using the wags-tails library. The exception is disease terms from OMIM, for which a source file must be manually acquired and placed in the omim folder within the data directory (by default, ~/.local/share/wags_tails/omim/). In order to access OMIM data, users must submit a request here. Once approved, the relevant OMIM file (mimTitles.txt) should be renamed according to the convention omim_YYYYMMDD.tsv, where YYYYMMDD indicates the date that the file was generated, and placed in the appropriate location.

Database Initialization

The Disease Normalizer supports two data storage options:

DynamoDB, a NoSQL service provided by AWS. This is our preferred storage solution. In addition to cloud deployment, Amazon also provides a tool for local service, which can be installed here. Once downloaded, you can start service by running java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb in a terminal (add a -port <VALUE> option to use a different port). By default, data will be added to a table named disease_normalizer, but an alternate table name can by given under the environment variable DISEASE_DYNAMO_TABLE.
PostgreSQL, a well-known relational database technology. Once starting the Postgres server process, ensure that a database is created (we typically name ours disease_normalizer).

By default, the Disease Normalizer expects to find a DynamoDB instance listening at http://localhost:8000. Alternative locations can be specified in two ways:

The first way is to set the --db_url command-line option to the URL endpoint.

disease_norm_update --update_all --db_url="http://localhost:8001"

The second way is to set the DISEASE_NORM_DB_URL environment variable to the URL endpoint.

export DISEASE_NORM_DB_URL="http://localhost:8001"

To use a PostgreSQL instance instead of DynamoDB, provide a PostgreSQL connection URL instead, e.g.

export DISEASE_NORM_DB_URL="postgresql://postgres@localhost:5432/disease_normalizer"

Adding and refreshing data

Use the disease_norm_update command in a shell to update the database.

If you encounter an error message like the following, refer to the installation instructions above:

"Encountered ModuleNotFoundError attempting to import Mondo. Are ETL dependencies installed?"

Update source(s)

The Disease Normalizer currently uses data from the following sources:

The National Cancer Institute Thesaurus (NCIt)
The Mondo Disease Ontology
The Online Mendelian Inheritance in Man (OMIM)
OncoTree
The Disease Ontology

As described above, all source data other than OMIM can be acquired automatically.

To update one source, simply set --sources to the source you wish to update. The normalizer will check to see if local source data is up-to-date, acquire the most recent data if not, and use it to populate the database.

For example, run the following to acquire the latest NCIt data if necessary, and update the NCIt disease records in the normalizer database:

disease_norm_update --sources="ncit"

To update multiple sources, you can use the --sources option with the source names separated by spaces.

Update all sources

To update all sources, use the --update_all flag:

disease_norm_update --update_all

Create Merged Concept Groups

The normalize endpoint relies on merged concept groups.

To create merged concept groups, use the --update_merged flag with the --update_all flag.

python3 -m disease.cli --update_all --update_merged

Starting the disease normalization service

Once the Disease Normalizer database has been loaded, from the project root, run the following:

uvicorn disease.main:app --reload

Next, view the OpenAPI docs on your local machine:

http://127.0.0.1:8000/disease

Developer instructions

Following are sections include instructions specifically for developers.

Installation

For a development install, we recommend using Pipenv. See the pipenv docs for direction on installing pipenv in your compute environment.

To get started, clone the repo and initialize the environment:

git clone https://github.com/cancervariants/disease-normalization
cd disease-normalization
pipenv shell
pipenv update
pipenv install --dev

Alternatively, install the pg, etl, dev, and test dependency groups in a virtual environment:

git clone https://github.com/cancervariants/gene-normalization
cd gene-normalization
python3 -m virtualenv venv
source venv/bin/activate
pip install -e ".[pg,etl,dev,test]"

Init coding style tests

Code style is managed by Ruff and checked prior to commit.

This performs checks for:

Code style
File endings
Added large files
AWS credentials
Private keys

Before first commit run:

pre-commit install

Running unit tests

Tests are provided via pytest.

pytest

By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the DISEASE_TEST environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.

export DISEASE_TEST=true
pytest

Sometimes, sources will update their data, and our test fixtures and data will become incorrect. The tests/scripts/ subdirectory includes scripts to rebuild data files, although most fixtures will need to be updated manually.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.0.dev3 pre-release

Dec 20, 2023

0.4.0.dev2 pre-release

Dec 4, 2023

This version

0.4.0.dev1 pre-release

Nov 28, 2023

0.4.0.dev0 pre-release

Oct 25, 2023

0.3.1

Jan 11, 2023

0.3.0

Nov 2, 2022

0.2.21.dev0 pre-release

May 12, 2023

0.2.20

May 7, 2023

0.2.19

Jan 11, 2023

0.2.18

Jan 10, 2023

0.2.17

Jan 6, 2023

0.2.16

Nov 2, 2022

0.2.15

Oct 31, 2022

0.2.14

Sep 23, 2022

0.2.13

Aug 24, 2022

0.2.12

Jan 24, 2022

0.2.11

Nov 18, 2021

0.2.10

Sep 7, 2021

0.2.9

Sep 7, 2021

0.2.7

Apr 15, 2021

0.2.4

Mar 31, 2021

0.2.3

Mar 30, 2021

0.2.2

Mar 29, 2021

0.2.1

Mar 29, 2021

0.2.0

Mar 12, 2021

0.1.1

Mar 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

disease-normalizer-0.4.0.dev1.tar.gz (43.5 kB view hashes)

Uploaded Nov 28, 2023 Source

Built Distribution

disease_normalizer-0.4.0.dev1-py3-none-any.whl (51.3 kB view hashes)

Uploaded Nov 28, 2023 Python 3

Hashes for disease-normalizer-0.4.0.dev1.tar.gz

Hashes for disease-normalizer-0.4.0.dev1.tar.gz
Algorithm	Hash digest
SHA256	`7813d9b56b195ed9a92225d036f2e9562440a8b31b6027479d3f0c77bbbe4452`
MD5	`91a902b72af52443a8a23c326fdc304b`
BLAKE2b-256	`1133fd8bf95e84bf15c14c9b554d4fb381d7a631a786123e726b640a363be0b4`

Hashes for disease_normalizer-0.4.0.dev1-py3-none-any.whl

Hashes for disease_normalizer-0.4.0.dev1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10fd09c72a05b7f7ec2f299c1592b75170c7c457a7166c17df944f1acd240e33`
MD5	`62e20139f56a82194994e67cef7c4043`
BLAKE2b-256	`adfd9b6b493b9bf995796f543ee1bda4895380d312645978d6fe0f09bde943a3`