Skip to main content

Concept annotation tool for Electronic Health Records

Project description

Medical oncept Annotation Tool

Build Status Latest release pypi Version

MedCAT can be used to extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS. Paper on arXiv.

News

  • New Minor Release [20. October 2021] Introducing model packs, new faster multiprocessing for large datasets (100M+ documents) and improved MetaCAT.
  • New Release [1. August 2021]: Upgraded MedCAT to use spaCy v3, new scispaCy models have to be downloaded - all old CDBs (compatble with MedCAT v1) will work without any changes.
  • New Feature and Tutorial [8. July 2021]: Integrating 🤗 Transformers with MedCAT for biomedical NER+L
  • General [1. April 2021]: MedCAT is upgraded to v1, unforunately this introduces breaking changes with older models (MedCAT v0.4), as well as potential problems with all code that used the MedCAT package. MedCAT v0.4 is available on the legacy branch and will still be supported until 1. July 2021 (with respect to potential bug fixes), after it will still be available but not updated anymore.
  • Paper: What’s in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization
  • (more...)

Demo

A demo application is available at MedCAT. This was trained on MIMIC-III and all of SNOMED-CT.

Tutorial

A guide on how to use MedCAT is available in the tutorial folder. Read more about MedCAT on Towards Data Science.

Related Projects

  • MedCATtrainer - an interface for building, improving and customising a given Named Entity Recognition and Linking (NER+L) model (MedCAT) for biomedical domain text.
  • MedCATservice - implements the MedCAT NLP application as a service behind a REST API.
  • iCAT - A docker container for CogStack/MedCAT/HuggingFace development in isolated environments.

Install using PIP (Requires Python 3.6+)

  1. Upgrade pip pip install --upgrade pip
  2. Install MedCAT
  • For macOS/linux: pip install --upgrade medcat
  • For Windows (see PyTorch documentation): pip install --upgrade medcat -f https://download.pytorch.org/whl/torch_stable.html
  1. Quickstart (MedCAT v1.2+):
from medcat.cat import CAT

# Download the model_pack from the models section in the github repo.
cat = CAT.load_model_pack('<path to downloaded zip file>')

# Test it
text = "My simple document with kidney failure"
entities = cat.get_entities(text)
print(entities)

# To run unsupervised training over documents
data_iterator = <your iterator>
cat.train(data_iterator)
#Once done, save the whole model_pack 
cat.create_model_pack(<save path>)
  1. Quick start with separate models: New Models (MedCAT v1.2+) need the spacy en_core_web_md while older ones use the scispacy models, install the one you need or all if not sure. If using model packs you do not need to download these models:
python -m spacy download en_core_web_md
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_md-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz
from medcat.vocab import Vocab
from medcat.cdb import CDB
from medcat.cat import CAT
from medcat.meta_cat import MetaCAT

# Load the vocab model you downloaded
vocab = Vocab.load(vocab_path)
# Load the cdb model you downloaded
cdb = CDB.load('<path to the cdb file>') 

# Download the mc_status model from the models section below and unzip it
mc_status = MetaCAT.load("<path to the unziped mc_status directory>")
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab, meta_cats=[mc_status])

# Test it
text = "My simple document with kidney failure"
entities = cat.get_entities(text)
print(entities)

# To run unsupervised training over documents
data_iterator = <your iterator>
cat.train(data_iterator)
#Once done you can make the current pipeline into a model_pack 
cat.create_model_pack(<save path>)

Models

A basic trained model is made public. It contains ~ 35K concepts available in MedMentions.

ModelPacks

  • MedMentions with Status (Is Concept Affirmed or Negated/Hypothetical) Download

Separate models

  • Vocabulary Download - Built from MedMentions

  • CDB Download - Built from MedMentions

  • MetaCAT Status Download - Built from a sample from MIMIC-III, detects is an annotation Affirmed (Positve) or Other (Negated or Hypothetical)

(Note: This was compiled from MedMentions and does not have any data from NLM as that data is not publicaly available.)

SNOMED-CT and UMLS

If you have access to UMLS or SNOMED-CT and can provide some proof (a screenshot of the UMLS profile page is perfect, feel free to redact all information you do not want to share), contact us - we are happy to share the pre-built CDB and Vocab for those databases.

Acknowledgements

Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS

The vocabulary was compiled from Wiktionary In total ~ 800K unique words

Powered By

A big thank you goes to spaCy and Hugging Face - who made life a million times easier.

Citation

@ARTICLE{Kraljevic2021-ln,
  title="Multi-domain clinical natural language processing with {MedCAT}: The Medical Concept Annotation Toolkit",
  author="Kraljevic, Zeljko and Searle, Thomas and Shek, Anthony and Roguski, Lukasz and Noor, Kawsar and Bean, Daniel and Mascio, Aurelie and Zhu, Leilei and Folarin, Amos A and Roberts, Angus and Bendayan, Rebecca and Richardson, Mark P and Stewart, Robert and Shah, Anoop D and Wong, Wai Keong and Ibrahim, Zina and Teo, James T and Dobson, Richard J B",
  journal="Artif. Intell. Med.",
  volume=117,
  pages="102083",
  month=jul,
  year=2021,
  issn="0933-3657",
  doi="10.1016/j.artmed.2021.102083"
}

Project details


Release history Release notifications | RSS feed

This version

1.2.5

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medcat-1.2.5.tar.gz (12.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medcat-1.2.5-py3-none-any.whl (124.7 kB view details)

Uploaded Python 3

File details

Details for the file medcat-1.2.5.tar.gz.

File metadata

  • Download URL: medcat-1.2.5.tar.gz
  • Upload date:
  • Size: 12.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for medcat-1.2.5.tar.gz
Algorithm Hash digest
SHA256 9cec94a1dc74511a07d2c9da8d1254224f3c65f4961050624733c857cddfef11
MD5 40a78d72ed144ef4c7ce3a9cfaa69c5f
BLAKE2b-256 a2e52f54c031c5addf8f1dd1cbf0ec5e3af206cf1e300ca6061ae7ea2d0f4cfd

See more details on using hashes here.

File details

Details for the file medcat-1.2.5-py3-none-any.whl.

File metadata

  • Download URL: medcat-1.2.5-py3-none-any.whl
  • Upload date:
  • Size: 124.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for medcat-1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 dc41f6b2feb5aaa9e40711bda84e4a8266418e518396ba5deb44a80d6714c28b
MD5 ca4f1c96d2ff79e08bfa9f8e3022e4cb
BLAKE2b-256 f565cce85ebcb1f6c07c12b2c12811b484ab890b81eafe7b5d9045190676c0ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page