Skip to main content

Concept annotation tool for Electronic Health Records

Project description

Medical oncept Annotation Tool

A simple tool for concept annotation from UMLS or any other source.

This is still experimental

How to use

There are a few ways to run CAT

PIP Installation

pip install --upgrade medcat

Please install the langauge models before running anything

python -m spacy download en_core_web_sm

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.0/en_core_sci_md-0.2.0.tar.gz

Building a new Concept Database (.csv) or using an existing one

First download the vocabulary from Vocabulary Download

Now in python3+

from medcat.cat import CAT
from medcat.utils.vocab import Vocab
from medcat.prepare_cdb import PrepareCDB
from medcat.cdb import CDB 

vocab = Vocab()

# Load the vocab model you just downloaded
vocab.load_dict('<path to the vocab file>')

# If you have an existing CDB
cdb = CDB()
cdb.load_dict('<path to the cdb file>') 

# If you need a special CDB you can build one from a .csv file
preparator = PrepareCDB(vocab=vocab)
csv_paths = ['<path to your csv_file>', '<another one>', ...] 
# e.g.
csv_paths = ['./examples/simple_cdb.csv']
cdb = preparator.prepare_csvs(csv_paths)

# Save the new CDB for later
cdb.save_dict("<path to a file where it will be saved>")

# To annotate documents we do
doc = "My simple document with kidney failure"
cat = CAT(cdb=cdb, vocab=vocab)
cat.train = False
doc_spacy = cat(doc)
# Entities are in
doc_spacy._.ents
# Or to get a json
doc_json = cat.get_json(doc)

# To have a look at the results:
from spacy import displacy
# Note that this will not show all entites, but only the longest ones
displacy.serve(doc_spacy, style='ent')

# To run cat on a large number of documents
data = [(<doc_id>, <text>), (<doc_id>, <text>), ...]
docs = cat.multi_processing(data)

Training and Fine-tuning

To fine-tune or train everything from the ground up (excluding word-vectors), you can use the following:

# Loadinga CDB or creating a new one is as above.

# To run the training do
f = open("<some file with a lot of medical text>", 'r')
# If you want fine tune set it to True, old training will be preserved
cat.run_training(f, fine_tune=False)

If building from source, the requirements are

python >= 3.5 [tested with 3.7, but most likely works with 3+]

All the rest can be instaled using pip from the requirements.txt file, by running:

pip install -r requirements.txt

Results

Dataset SoftF1 Description
MedMentions 0.83 The whole MedMentions dataset without any modifications or supervised training
MedMentions 0.828 MedMentions only for concepts that require disambiguation, or names that map to more CUIs
MedMentions 0.93 Medmentions filterd by TUI to only concepts that are a disease

Models

A basic trained model is made public for the vocabulary. It is trained for the 35K entities available in MedMentions. It is quite limited so the performance might not be the best.

Vocabulary Download - Built from MedMentions

(Note: This is was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)

Acknowledgement

Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS

The dictionary was compiled from Wiktionary In total ~ 800K unique words For now NOT made publicaly available

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medcat-0.2.0.0.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medcat-0.2.0.0-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file medcat-0.2.0.0.tar.gz.

File metadata

  • Download URL: medcat-0.2.0.0.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.0

File hashes

Hashes for medcat-0.2.0.0.tar.gz
Algorithm Hash digest
SHA256 33b23ce2b6b11e59f3125fbef9a54d8af3ab73419eca587408b05c9bc2ea42b4
MD5 8ac9fe81c72fd84c24170db8fe71ff46
BLAKE2b-256 f9dd793c3773605a5d6febfcd75ee54d7bdb613af8c01baf65142380375fa94d

See more details on using hashes here.

File details

Details for the file medcat-0.2.0.0-py3-none-any.whl.

File metadata

  • Download URL: medcat-0.2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 33.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.0

File hashes

Hashes for medcat-0.2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de7b80dac82bccfd2dd7c3ab25a77c2a908167b79cb42f5b1a30d5fad495a414
MD5 00837146fc69edd2dfd51da40b6b0c77
BLAKE2b-256 4a366773663f6db74c30c8aeb41d0d23f9381ba30e1a57a065df57bdbf759d60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page