Skip to main content

Concept annotation tool for Electronic Health Records

Project description

oncept Annotation Tool

A simple tool for concept annotation from UMLS or any other source.

This is still experimental

How to use

There are a few ways to run CAT, simplest one being docker.

Docker

If using docker the appropriate models will be automatically downloaded, you only need to run:

docker build --network=host -t cat -f Dockerfile.MedMen .

Once the container is built start it using:

docker run --env-file=./envs/env_medann -p 5000:5000 cat

You can now access the API on

<YOUR_IP>:5000/api

OR a simple frontend

<YOUR_IP>:5000/api_test

Installation

pip install --upgrade medcat

Please install the langauge models before running anything

python -m spacy download en_core_web_sm

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.0/en_core_sci_md-0.2.0.tar.gz

Building a new Concept Database (.csv) or using an existing one

First download the vocabulary from Vocabulary Download

Now in python3+

from cat.cat import CAT
from cat.utils.vocab import Vocab
from cat.prepare_cdb import PrepareCDB
from cat.cdb import CDB 

vocab = Vocab()

# Load the vocab model you just downloaded
vocab.load_dict('<path to the vocab file>')

# If you have an existing CDB
cdb = CDB()
cdb.load_dict('<path to the cdb file>') 

# If you need a special CDB you can build one from a .csv file
preparator = PrepareCDB(vocab=vocab)
csv_paths = ['<path to your csv_file>', '<another one>', ...] 
# e.g.
csv_paths = ['./examples/simple_cdb.csv']
cdb = preparator.prepare_csvs(csv_paths)

# Save the new CDB for later
cdb.save_dict("<path to a file where it will be saved>")

# To annotate documents we do
doc = "My simple document with kidney failure"
cat = CAT(cdb=cdb, vocab=vocab)
cat.train = False
doc_spacy = cat(doc)
# Entities are in
doc_spacy._.ents
# Or to get a json
doc_json = cat.get_json(doc)

# To have a look at the results:
from spacy import displacy
# Note that this will not show all entites, but only the longest ones
displacy.serve(doc_spacy, style='ent')

# To run cat on a large number of documents
data = [(<doc_id>, <text>), (<doc_id>, <text>), ...]
docs = cat.multi_processing(data)

Training and Fine-tuning

To fine-tune or train everything from the ground up (excluding word-vectors), you can use the following:

# Loadinga CDB or creating a new one is as above.

# To run the training do
f = open("<some file with a lot of medical text>", 'r')
# If you want fine tune set it to True, old training will be preserved
cat.run_training(f, fine_tune=False)

If building from source, the requirements are

python >= 3.5 [tested with 3.7, but most likely works with 3+]

All the rest can be instaled using pip from the requirements.txt file, by running:

pip install -r requirements.txt

Results

Dataset SoftF1 Description
MedMentions 0.798 The whole MedMentions dataset without any modifications or supervised training
MedMentions 0.786 MedMentions only for concepts that require disambiguation, or names that map to more CUIs
MedMentions 0.92 Medmentions filterd by TUI to only concepts that are a disease

Models

A basic trained model is made public. It is trained for the 35K entities available in MedMentions. It is quite limited so the performance might not be the best.

Vocabulary Download - Built from MedMentions

Trained CDB Download

(Note: This is was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)

Acknowledgement

Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS

The dictionary was compiled from Wiktionary In total ~ 800K unique words For now NOT made publicaly available

Project details


Release history Release notifications | RSS feed

This version

0.1.7

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medcat-0.1.7.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medcat-0.1.7-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file medcat-0.1.7.tar.gz.

File metadata

  • Download URL: medcat-0.1.7.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.0

File hashes

Hashes for medcat-0.1.7.tar.gz
Algorithm Hash digest
SHA256 9730ab2322ef5edb0ac98c284bde36ad6824123f3384de7e4514c19794960373
MD5 d9cb0cf3d14e099c0690ccd987a8691d
BLAKE2b-256 3f51c359dbd0a39698acbaef551fb501444d7e8df426f13f44377add2c5aa631

See more details on using hashes here.

File details

Details for the file medcat-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: medcat-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 30.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.0

File hashes

Hashes for medcat-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 aad0b0ded4d664c7075cdbc4bef7eb404740c1afc6a2556c854cb7742282bf97
MD5 a8d7a39222bfd39d68c35f2488e87f09
BLAKE2b-256 eb805220e77f5bba7be3f852c577044e74811d60f5c78dbdbad06c5552f3150f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page