Skip to main content

Concept annotation tool for Electronic Health Records

Project description

Medical oncept Annotation Tool

Build Status Documentation Status Latest release pypi Version

MedCAT can be used to extract information from Electronic Health Records (EHRs) and link it to biomedical ontologies like SNOMED-CT and UMLS. Paper on arXiv.

Official Docs here

Discussion Forum discourse

Available Models

We have 4 public models available:

  1. UMLS Small (A modelpack containing a subset of UMLS (disorders, symptoms, medications...). Trained on MIMIC-III)
  2. SNOMED International (Full SNOMED modelpack trained on MIMIC-III)
  3. UMLS Dutch v1.10 (a modelpack provided by UMC Utrecht containing UMLS entities with Dutch names trained on Dutch medical wikipedia articles and a negation detection model repository/paper trained on EMC Dutch Clinical Corpus).
  4. UMLS Full. >4MM concepts trained self-supervsied on MIMIC-III. v2022AA of UMLS.

To download any of these models, please follow this link and sign into your NIH profile / UMLS license. You will then be redirected to the MedCAT model download form. Please complete this form and you will be provided a download link.



A demo application is available at MedCAT. This was trained on MIMIC-III and all of SNOMED-CT.


A guide on how to use MedCAT is available at MedCAT Tutorials. Read more about MedCAT on Towards Data Science.


Since MedCAT is primarily a library, logging has been effectively disabled by default. The idea is that the user of the library should have the choice of what, where, and how to log the information from a specific library they are using.

The idea is that the user can directly modify the logging behaviour of either the entire library or a certain set of modules within as they wish. We have provided a convenience method to add default handlers that log into the console as well as medcat.log (medcat.add_default_log_handlers).

Some details as to how one can configure the logging are described in the MedCAT Tutorials. PS: Currently (temporarily!) the tutorial is in the tutorials folder.


Entity extraction was trained on MedMentions In total it has ~ 35K entites from UMLS

The vocabulary was compiled from Wiktionary In total ~ 800K unique words

Powered By

A big thank you goes to spaCy and Hugging Face - who made life a million times easier.


  title="Multi-domain clinical natural language processing with {MedCAT}: The Medical Concept Annotation Toolkit",
  author="Kraljevic, Zeljko and Searle, Thomas and Shek, Anthony and Roguski, Lukasz and Noor, Kawsar and Bean, Daniel and Mascio, Aurelie and Zhu, Leilei and Folarin, Amos A and Roberts, Angus and Bendayan, Rebecca and Richardson, Mark P and Stewart, Robert and Shah, Anoop D and Wong, Wai Keong and Ibrahim, Zina and Teo, James T and Dobson, Richard J B",
  journal="Artif. Intell. Med.",

Project details

Release history Release notifications | RSS feed

This version


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medcat-1.6.0.tar.gz (10.8 MB view hashes)

Uploaded source

Built Distribution

medcat-1.6.0-py3-none-any.whl (167.1 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page