Skip to main content

This package computes a variety of similarity metrics between concepts present in the UMLS database

Project description

Overview

This package computes a variety of semantic similarity metrics between concepts present in the UMLS (Unified Medical Language System) database. It serves as a Python wrapper based off the Perl modules (UMLS Interface and UMLS Similarity) developed by Dr. Bridget McInnes and Dr. Ted Pedersen, offering an accessible and user-friendly interface for Python users.

Check out the documentation here: https://pyumls-similarity.readthedocs.io/en/latest/

Available Similarity Measures

* The basic path measure --> path
* The undirected path measure --> upath
* Leacock and Chodorow (1998) --> lch
* Wu and Palmer (1994) --> wup
* Zhong, et al. (2002) --> zhong
* Rada, et. al. (1989) --> cdist
* Nguyan and Al-Mubaid (2006) --> nam
* Resnik (1996) --> res
* Lin (1988) --> lin
* Jiang and Conrath (1997) --> jcn
* The vector measure --> vector
* Pekar and Staab (2002) --> pks
* Pirro and Euzenat (2010) --> faith
* Maedche and Staab (2001) --> cmatch
* Batet, et al (2011) --> batet
* Sanchez, et al. (2012) --> sanchez

Installation

To install PyUMLS_Similarity, run the following command:

pip install PyUMLS-Similarity

Prerequisites

Before using the PyUMLS_Similarity package, ensure that you have the following prerequisites installed and set up:

Strawberry Perl

The package requires Strawberry Perl to run Perl scripts. Download and install it from Strawberry Perl's official website.

MySQL

A local MySQL database instance is required to store and access UMLS data. Download and install MySQL from MySQL's official download page. This package was tested on MySQL 8.1.0.

In order to work efficiently with the UMLS, you'll want to configure MySQL. A good starting point is to use the parameters designated by the UMLS found here.

UMLS Data

You need to have a local instance of the UMLS installed in MySQL. This involves downloading UMLS data and importing it into your MySQL database. Follow the guidelines provided by the UMLS for obtaining a license and downloading the UMLS data.

UMLS-Interface and UMLS-Similarity Perl Modules

The package depends on the UMLS-Interface and UMLS-Similarity Perl modules. If you are interested in using feature-based semantic similarity metrics you'll also want to download WordNet and the associated Perl modules. After installing Strawberry Perl, install these modules using CPAN:

cpanm UMLS::Interface --force
cpanm UMLS::Similarity --force
cpanm WordNet::QueryData
cpanm WordNet::Similarity

Usage

IMPORTANT: The first time you run a path based semantic similarity metric calculation, the UMLS Interface needs to create an index within MySQL of your UMLS instance for efficient pathing calculations in subsequent runs. This can be a long process depending on your machine hardware and your MySQL configuration. The default source vocabulary (SAB) is the Medical Subject Headings (MSH) in the UMLS Metathesaurus. Indexing this was relatively fast in my machine (a few minutes). It is possible to use/include other SABs as part of your UMLS Interface configuration like SNOMED, LOINC, CPT, etc. however, be warned that this will exponentially increase both the required memory for your process AND the time required for the indexing. For example, indexing SNOMED took about 2 days.

Below are some examples of how to use the PyUMLS_Similarity package.

Start by initiating an instance of the PyUMLS_Similarity class:

from PyUMLS_Similarity import PyUMLS_Similarity

# define MySQL information that stores UMLS data in your computer
mysql_info = {}
mysql_info = {
    "username": "root",
    "password": "your_password",
    "hostname": "localhost",
    "socket": "MYSQL",
    "database": "umls"
}

umls_sim = PyUMLS_Similarity(mysql_info=mysql_info)

Computing Multiple Similarity Metrics

You can compute similarity metrics between UMLS concepts as shown below.

You can either provide a list of tuples contains the CUIs to be compared:

cui_pairs = [
    ('C0018563', 'C0037303'),
    ('C0035078', 'C0035078'),
]

Or you can provide a list of tuples containing the medical terms you want to be compare:

cui_pairs = [
    ('hand', 'skull'),
    ('Renal failure', 'Kidney failure'),
]

Compute similarity using specific measures

measures = ['lch', 'wup']
similarity_df = umls_sim.similarity(cui_pairs, measures)

An example output would look something like this:

Term 1 Term 2 CUI 1 CUI 2 lch wup
0 hand skull C0018563 C0037303 0.500 0.700
1 Renal failure Kidney failure C0035078 C0035078 1.000 1.000

Finding Shortest Path

To find the shortest path between concepts:

shortest_path_df = umls_sim.find_shortest_path(cui_pairs)

An example output would look something like this:

Term 1 Term 2 CUI 1 CUI 2 Path Length Path
0 hand skull C0018563 C0037303 9 C0018563 => C1140618 => C0015385 => C0005898 =...
1 Renal failure Kidney failure C0035078 C0035078 1 C0035078

IMPORTANT: This function has not been optimized for performance yet and can lead to long runtimes.

Finding Least Common Subsumer

To find the least common subsumer (LCS) of concepts:

lcs_df = umls_sim.find_least_common_subsumer(cui_pairs)

An example output would look something like this:

Term 1 Term 2 CUI 1 CUI 2 LCS Min Depth Max Depth
0 hand skull C0018563 C0037303 Anatomy (MeSH Category) (C0002807) 5 5
1 Renal failure Kidney failure C0035078 C0035078 Renal failure (C0035078) 1 1

Concurrency

PyUMLS_Similarity also supports running tasks concurrently for efficiency. Each time the Perl module is called it triggers a new connection to the database. This overhead is actually the most time consuming portion and running functions sequentially and/or separately adds up more and more overhead. To save time, I've made it so multiple functions can be run concurrently via Python's threading module. This essentially removes the overhead time of any additional function calls.

tasks = [
    {'function': 'similarity', 'arguments': (cui_pairs, measures)},
    {'function': 'shortest_path', 'arguments': (cui_pairs)},
    {'function': 'lcs', 'arguments': (cui_pairs)}
]

results = umls_sim.run_concurrently(tasks)

Acknowledgements

This package is based on the Perl modules developed by Dr. Bridget McInnes and Dr. Ted Pedersen. The package umls-similarity by Donghua Chen also served as inspiration for this package.

Future Developments

Future developments of this package will

  • allow for calculations of standard similarity metrics like cosine similarity, sorensen-dice index, jaccard similarity, and others
  • allow for modifications of the UMLS Interface Configuration file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyumls_similarity-0.1.1.tar.gz (10.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyumls_similarity-0.1.1-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

PyUMLS_Similarity-0.1.1-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file pyumls_similarity-0.1.1.tar.gz.

File metadata

  • Download URL: pyumls_similarity-0.1.1.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for pyumls_similarity-0.1.1.tar.gz
Algorithm Hash digest
SHA256 25869ce3bb4dfffbcae4fb1cacd9509363dcab0511e3c064f76ba5c0bd1b3b96
MD5 7dacf7e9defc814f47d4fbe08e7821d3
BLAKE2b-256 9215a5ecd1d5afa3ce167e516c7ada88bdb0b18d15d8e22d2d3d01028be2284b

See more details on using hashes here.

File details

Details for the file pyumls_similarity-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pyumls_similarity-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 12c2a99ac0ba35d8f32c7e8c23fff215b5d929433a422e6e0a1b803a166f0496
MD5 e11cc8fea196487bd1497edb4cf388a0
BLAKE2b-256 72612f7eb281e6d7ef2d02bf31d09c286ef6ae373bafa248e93c95d018dd0e92

See more details on using hashes here.

File details

Details for the file PyUMLS_Similarity-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for PyUMLS_Similarity-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 357177e34de4666c179fddb5be571ab79806b6018e37c677eef6b57b44ddc62c
MD5 05e8808e50c3d6269f80e63ecb27700c
BLAKE2b-256 f545e3749eb4f6751e6cb7b3141544d919f7d1e002ef8c07829de4e2379928e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page