Concept annotation tool for Electronic Health Records
Project description
oncept Annotation Tool
A simple tool for concept annotation from UMLS or any other source.
This is still experimental
How to use
There are a few ways to run CAT, simplest one being docker.
Docker
If using docker the appropriate models will be automatically downloaded, you only need to run:
docker build --network=host -t cat -f Dockerfile.MedMen .
Once the container is built start it using:
docker run --env-file=./envs/env_medann -p 5000:5000 cat
You can now access the API on
<YOUR_IP>:5000/api
OR a simple frontend
<YOUR_IP>:5000/api_test
Installation
pip install --upgrade medcat
Please install the langauge models before running anything
python -m spacy download en_core_web_sm
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.0/en_core_sci_md-0.2.0.tar.gz
Building a new Concept Database (.csv) or using an existing one
First download the vocabulary from Vocabulary Download
Now in python3+
from cat.cat import CAT
from cat.utils.vocab import Vocab
from cat.prepare_cdb import PrepareCDB
from cat.cdb import CDB
vocab = Vocab()
# Load the vocab model you just downloaded
vocab.load_dict('<path to the vocab file>')
# If you have an existing CDB
cdb = CDB()
cdb.load_dict('<path to the cdb file>')
# If you need a special CDB you can build one from a .csv file
preparator = PrepareCDB(vocab=vocab)
csv_paths = ['<path to your csv_file>', '<another one>', ...]
# e.g.
csv_paths = ['./examples/simple_cdb.csv']
cdb = preparator.prepare_csvs(csv_paths)
# Save the new CDB for later
cdb.save_dict("<path to a file where it will be saved>")
# To annotate documents we do
doc = "My simple document with kidney failure"
cat = CAT(cdb=cdb, vocab=vocab)
cat.train = False
doc_spacy = cat(doc)
# Entities are in
doc_spacy._.ents
# Or to get a json
doc_json = cat.get_json(doc)
# To have a look at the results:
from spacy import displacy
# Note that this will not show all entites, but only the longest ones
displacy.serve(doc_spacy, style='ent')
# To run cat on a large number of documents
data = [(<doc_id>, <text>), (<doc_id>, <text>), ...]
docs = cat.multi_processing(data)
Training and Fine-tuning
To fine-tune or train everything from the ground up (excluding word-vectors), you can use the following:
# Loadinga CDB or creating a new one is as above.
# To run the training do
f = open("<some file with a lot of medical text>", 'r')
# If you want fine tune set it to True, old training will be preserved
cat.run_training(f, fine_tune=False)
If building from source, the requirements are
python >= 3.5
[tested with 3.7, but most likely works with 3+]
All the rest can be instaled using pip
from the requirements.txt file, by running:
pip install -r requirements.txt
Results
Dataset | SoftF1 | Description |
---|---|---|
MedMentions | 0.798 | The whole MedMentions dataset without any modifications or supervised training |
MedMentions | 0.786 | MedMentions only for concepts that require disambiguation, or names that map to more CUIs |
MedMentions | 0.92 | Medmentions filterd by TUI to only concepts that are a disease |
Models
A basic trained model is made public. It is trained for the 35K entities available in MedMentions
. It is quite limited
so the performance might not be the best.
Vocabulary Download - Built from MedMentions
Trained CDB Download
(Note: This is was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)
Acknowledgement
Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS
The dictionary was compiled from Wiktionary In total ~ 800K unique words For now NOT made publicaly available
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.