Skip to main content

Concept annotation tool for Electronic Health Records

Project description

Medical oncept Annotation Tool

A simple tool for concept annotation from UMLS/SNOMED or any other source. Paper on arXiv.

Demo

A demo application is available at MedCAT. Please note that this was trained on MedMentions and contains a very small portion of UMLS (<1%).

Install using PIP

  1. Install MedCAT

pip install --upgrade medcat

  1. Get the scispacy models:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz

  1. Downlad the Vocabulary and CDB from the Models section bellow

  2. How to use:

from medcat.cat import CAT
from medcat.utils.vocab import Vocab
from medcat.cdb import CDB 

vocab = Vocab()
# Load the vocab model you downloaded
vocab.load_dict('<path to the vocab file>')

# Load the cdb model you downloaded
cdb = CDB()
cdb.load_dict('<path to the cdb file>') 

# create cat
cat = CAT(cdb=cdb, vocab=vocab)
cat.train = False

# Test it
doc = "My simple document with kidney failure"
doc_spacy = cat(doc)
# Entities are in
doc_spacy._.ents
# Or to get a json
doc_json = cat.get_json(doc)

# To have a look at the results:
from spacy import displacy
# Note that this will not show all entites, but only the longest ones
displacy.serve(doc_spacy, style='ent')

# To train - unsupervised, set the train flag to True and run
#documents through MedCAT
cat.train = True

# And now run cat again, it will train in the background
data = [<text>, <text>, ...]
for text in data:
  _ = cat(text)

# Save the new trained cdb
cdb.save_dict(<save_path>)

# Done

Building a new Concept Database

from medcat.cat import CAT
from medcat.utils.vocab import Vocab
from medcat.cdb import CDB 

vocab = Vocab()
# Load the vocab model you downloaded
vocab.load_dict('<path to the vocab file>')

# If you have an existing CDB
cdb = CDB()
cdb.load_dict('<path to the cdb file>') 

# You can now add concepts from a CSV file, examples of the files can be found in ./examples
preparator = PrepareCDB(vocab=vocab)
csv_paths = ['<path to your csv_file>', '<another one>', ...] 
# e.g.
csv_paths = ['./examples/simple_cdb.csv']
cdb = preparator.prepare_csvs(csv_paths)

# Save the new CDB for later
cdb.save_dict("<path to a file where it will be saved>")
# Done

If building from source, the requirements are

python >= 3.5

All the rest can be instaled using pip from the requirements.txt file, by running:

pip install -r requirements.txt

Results

Dataset SoftF1 Description
MedMentions 0.84 The whole MedMentions dataset without any modifications or supervised training
MedMentions 0.828 MedMentions only for concepts that require disambiguation, or names that map to more CUIs
MedMentions 0.97 Medmentions filterd by TUI to only concepts that are a disease

Models

A basic trained model is made public for the vocabulary and CDB. It is trained for the ~ 35K concepts available in MedMentions. It is quite limited so the performance might not be the best.

Vocabulary Download - Built from MedMentions

CDB Download - Built from MedMentions

(Note: This is was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)

Acknowledgement

Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS

The dictionary was compiled from Wiktionary In total ~ 800K unique words For now NOT made publicaly available

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medcat-0.3.2.0.tar.gz (41.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medcat-0.3.2.0-py3-none-any.whl (52.8 kB view details)

Uploaded Python 3

File details

Details for the file medcat-0.3.2.0.tar.gz.

File metadata

  • Download URL: medcat-0.3.2.0.tar.gz
  • Upload date:
  • Size: 41.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.0

File hashes

Hashes for medcat-0.3.2.0.tar.gz
Algorithm Hash digest
SHA256 919aefca9c3f622ca38ab0f95ccaf849873c911b8877d64253da0409bf132a03
MD5 0b38fcbc65e299e2fd0f743d2cd42f33
BLAKE2b-256 d37f5354804914693101a908b555653eb18441fea5df9346dcf2950357db3c79

See more details on using hashes here.

File details

Details for the file medcat-0.3.2.0-py3-none-any.whl.

File metadata

  • Download URL: medcat-0.3.2.0-py3-none-any.whl
  • Upload date:
  • Size: 52.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.0

File hashes

Hashes for medcat-0.3.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 044c20f6748d3fa4bf405f58d46474bb4e785cf034f2bca558713b9cc416ef01
MD5 968fa894f18dc63152ab95e6094ab46d
BLAKE2b-256 a20da9baeda9323f641be65a59f88bbd7403af0351b7f4edc0641ad80c3128d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page