VICC normalization routine for diseases
Project description
Disease Normalizer
Services and guidelines for normalizing disease terms
Installation
The Disease Normalizer is available via PyPI:
pip install disease-normalizer[etl,pg]
The [etl,pg] argument tells pip to install packages to fulfill the dependencies of the gene.etl package and the PostgreSQL data storage implementation alongside the default DynamoDB data storage implementation.
External requirements
The Disease Normalizer can retrieve most required data itself. The exception is disease terms from OMIM, for which a source file must be manually acquired and placed in the disease/data/omim
folder within the library root. In order to access OMIM data, users must submit a request here. Once approved, the relevant OMIM file (mimTitles.txt
) should be renamed according to the convention omim_YYYYMMDD.tsv
, where YYYYMMDD
indicates the date that the file was generated, and placed in the appropriate location.
Database Initialization
The Disease Normalizer supports two data storage options:
- DynamoDB, a NoSQL service provided by AWS. This is our preferred storage solution. In addition to cloud deployment, Amazon also provides a tool for local service, which can be installed here. Once downloaded, you can start service by running
java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb
in a terminal (add a-port <VALUE>
option to use a different port) - PostgreSQL, a well-known relational database technology. Once starting the Postgres server process, ensure that a database is created (we typically name ours
disease_normalizer
).
By default, the Disease Normalizer expects to find a DynamoDB instance listening at http://localhost:8000
. Alternative locations can be specified in two ways:
The first way is to set the --db_url
command-line option to the URL endpoint.
disease_norm_update --update_all --db_url="http://localhost:8001"
The second way is to set the DISEASE_NORM_DB_URL
environment variable to the URL endpoint.
export DISEASE_NORM_DB_URL="http://localhost:8001"
To use a PostgreSQL instance instead of DynamoDB, provide a PostgreSQL connection URL instead, e.g.
export DISEASE_NORM_DB_URL="postgresql://postgres@localhost:5432/disease_normalizer"
Adding and refreshing data
Use the disease_norm_update
command in a shell to update the database.
Update source(s)
The Disease Normalizer currently uses data from the following sources:
- The National Cancer Institute Thesaurus (NCIt)
- The Mondo Disease Ontology
- The Online Mendelian Inheritance in Man (OMIM)
- OncoTree
- The Disease Ontology
As described above, all source data other than OMIM can be acquired automatically.
To update one source, simply set --normalizer
to the source you wish to update. The normalizer will check to see if local source data is up-to-date, acquire the most recent data if not, and use it to populate the database.
For example, run the following to acquire the latest NCIt data if necessary, and update the NCIt disease records in the normalizer database:
disease_norm_update --normalizer="ncit"
To update multiple sources, you can use the --normalizer
option with the source names separated by spaces.
Update all sources
To update all sources, use the --update_all
flag:
disease_norm_update --update_all
Create Merged Concept Groups
The normalize
endpoint relies on merged concept groups.
To create merged concept groups, use the --update_merged
flag with the --update_all
flag.
python3 -m disease.cli --update_all --update_merged
Starting the disease normalization service
Once the Disease Normalizer database has been loaded, from the project root, run the following:
uvicorn disease.main:app --reload
Next, view the OpenAPI docs on your local machine:
Developer instructions
Following are sections include instructions specifically for developers.
Installation
For a development install, we recommend using Pipenv. See the pipenv docs for direction on installing pipenv in your compute environment.
To get started, clone the repo and initialize the environment:
git clone https://github.com/cancervariants/disease-normalization
cd disease-normalization
pipenv shell
pipenv update
pipenv install --dev
Alternatively, install the pg
, etl
, dev
, and test dependency groups in a virtual environment:
git clone https://github.com/cancervariants/gene-normalization
cd gene-normalization
python3 -m virtualenv venv
source venv/bin/activate
pip install -e ".[pg,etl,dev,test]"
Init coding style tests
Code style is managed by flake8 and checked prior to commit.
We use pre-commit to run conformance tests.
This ensures:
- Check code style
- Check for added large files
- Detect AWS Credentials
- Detect Private Key
Before first commit run:
pre-commit install
Running unit tests
Tests are provided via pytest.
pytest
By default, tests will employ an existing DynamoDB database. For test environments where this is unavailable (e.g. in CI), the DISEASE_TEST
environment variable can be set to initialize a local DynamoDB instance with miniature versions of input data files before tests are executed.
export DISEASE_TEST=true
pytest
Sometimes, sources will update their data, and our test fixtures and data will become incorrect. The tests/scripts/
subdirectory includes scripts to rebuild data files, although most fixtures will need to be updated manually.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file disease_normalizer-0.2.21.dev0-py3-none-any.whl
.
File metadata
- Download URL: disease_normalizer-0.2.21.dev0-py3-none-any.whl
- Upload date:
- Size: 52.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc5ab65a0d8d7726eb9915938a04f8154bace031cf433c42abf0a29e10a3cafe |
|
MD5 | 61f9b0e467c38c4a156b13e778c3d408 |
|
BLAKE2b-256 | b152f4fe96e089176d04f1898df854441349a71daf60c51349701b38357ab8bd |