Skip to main content

Transformers for Clinical NLP

Project description

Clinical NLP Transformers (cnlp_transformers)

Transformers for Clinical NLP

This library was created to add abstractions on top of the Huggingface Transformers library for many clinical NLP research use cases. Primary use cases include

  1. simplifying multiple tasks related to fine-tuning of transformers for building models for clinical NLP research, and
  2. creating inference APIs that will allow downstream researchers easier access to clinical NLP outputs.

This library is not intended to serve as a place for clinical NLP applications to live. If you build something cool that uses transformer models that take advantage of our model definitions, the best practice is probably to rely on it as a library rather than treating it as your workspace. This library is also not intended as a deployment-ready tool for scalable clinical NLP. There is a lot of interest in developing methods and tools that are smaller and can process millions of records, and this library can potentially be used for research along those line. But it will probably never be extremely optimized or shrink-wrapped for applications. However, there should be plenty of examples and useful code for people who are interested in that type of deployment.

Install

[!IMPORTANT] When installing the library's dependencies, PyTorch will probably be installed with CUDA 12.6 support by default on linux, and without CUDA support on other platforms. If you would like to run the library in CPU-only mode or with a specific version of CUDA, install PyTorch to your desired specifications in your virtual environment first before installing cnlp-transformers. See here if using uv.

Static installation

If you are installing just to fine-tune or run the REST APIs, you can install without cloning using uv:

uv pip install cnlp-transformers

Or with pip:

pip install cnlp-transformers

If you prefer, prebuilt Docker images are also available to run the REST APIs in a network. An example Docker Compose configuration is also available for reference.

Editable installation

If you want to modify code (e.g., for developing new models), then install locally:

  1. Clone this repository:

    # Either the HTTPS method...
    $ git clone https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers.git
    # ...or the SSH method
    $ git clone git@github.com:Machine-Learning-for-Medical-Language/cnlp_transformers.git
    
  2. Enter the repo: cd cnlp_transformers

  3. Follow the instructions here to set up your Python environment.

Fine-tuning

The main entry point for fine-tuning is the cnlp_transformers/src/cnlpt/train_system.py script. Run with no arguments to show an extensive list of options that are allowed, inheriting from and extending the Huggingface training options.

Workflow

To use the library for fine-tuning, you'll need to take the following steps:

  1. Write your dataset to one of the following formats in a folder with train, dev, and test files:

    1. csv or tsv: The first row should have column names separated by comma or tab. The name text has special meaning as the input string. Likewise if there are columns named text_a and text_b it will be interpreted as two parts of a transformer input string separated by a <sep>-token equivalent. All other columns are treated as potential targets -- their names can be passed to the train_system.py script as --task_name arguments. For tagging targets, the field must consist of space-delimited labels, one per space-delimited token in the text field. For relation extraction targets, the field must be a , delimited list of relation tuples, where each relation tuple is (<offset 1>, <offset 2>,label), where offset 1 and 2 are token indices into the space-delimited tokens in the text field.

    2. json: The file format must be the following:

      { 
        "data": [
          { 
            "text": "<text of instance>",
            "id": "<instance id>",
            "<sub-task 1 name>": "<instance label>",
            "<sub-task 2 name>": "<instance label>",
            // ... other labels
          },
          // ...
          {
            // instance N
          },
        ],
        "metadata": {
          "version": "<optional dataset versioning>",
          "task": "<overall task/dataset name>",
          "subtasks": [
            {
              "task_name": "<sub-task 1 name>",
              "output_mode": "<sub-task output mode (e.g. tagging, relex, classification)>",
            },
            // ...
            {
              "task_name": "<sub-task n name>",
              "output_mode": "<sub-task output mode (e.g. tagging, relex, classification)>",
            }
          ]
        }
      }
      

      Instance labels should be formatted the same way as in the csv/tsv example above, see specifically the formats for tagging and relations. The 'metadata' field can either be included in the train/dev/test files or as a separate metadata.json file.

  2. Run train_system.py with a --model-type (one of cnn, lstm, hier, or proj), and a --data-dir (path to the folder you created in step 1). Optionally specify one or more --task names to train on. By default all tasks will be trained.

Step-by-step finetuning examples

We provided the following step-by-step examples how to finetune in clinical NLP tasks:

1. Classification task: using Drug Reviews (Druglib.com) Data Set

2. Sequence tagging task: using ChemProt website

Fine-tuning options

Run cnlpt train --help to see all the available options. In addition to inherited Huggingface Transformers options, there are options to do the following:

  • Select different models: --model hier uses a hierarchical transformer layer on top of a specified encoder model. We recommend using a very small encoder: --encoder microsoft/xtremedistil-l6-h256-uncased so that the full model fits into memory.
  • Run simple baselines (use --model cnn|lstm --tokenizer roberta-base -- since there is no HF model then you must specify the tokenizer explicitly)
  • Use a different layer's CLS token for the classification (e.g., --layer 10)
  • Probabilistically freeze weights of the encoder (leaving classifier weights all unfrozen) (--freeze alone freezes all encoder weights, --freeze <float> when given a parameter between 0 and 1, freezes that percentage of encoder weights)
  • Classify based on a token embedding instead of the CLS embedding (--token -- applies to the event/entity classification setting only, and requires the input to have xml-style tags (<e>, </e>) around the tokens representing the event/entity)
  • Use class-weighted loss function (--class_weights)

Running REST APIs

This library supports serving a REST API for your model with a single /process endpoint to process text and generate predictions, via the cnlpt rest command.

Run cnlpt rest --help to see available options. The only required option is --model, which must be either a HuggingFace repository or a local directory containing your model. By default, the model will be served at http://localhost:8000.

For example, to run our negation detection model from HuggingFace:

cnlpt rest --model mlml-chip/negation_pubmedbert_sharpseed

Once the application is running, you can either interact with it via web interface at http://localhost:8000/docs, or manually send requests to the /process endpoint:

>>> import requests
>>> from pprint import pprint
>>> sent = "The patient has a sore knee and headache but denies nausea and has no anosmia."
>>> ents = [(18, 27), (32, 40), (52, 58), (70, 77)]
>>> doc = {"text": sent, "entity_spans": ents}
>>> resp = requests.post("http://localhost:8000/process", json=doc)
>>> pprint(resp.json())
[{'Negation': {'prediction': '-1',
               'probs': {'-1': 0.9997619986534119, '1': 0.0002379878715146333}},
  'text': 'The patient has a <e>sore knee</e> and headache but denies nausea '
          'and has no anosmia.'},
 {'Negation': {'prediction': '-1',
               'probs': {'-1': 0.9995606541633606, '1': 0.0004393413255456835}},
  'text': 'The patient has a sore knee and <e>headache</e> but denies nausea '
          'and has no anosmia.'},
 {'Negation': {'prediction': '1',
               'probs': {'-1': 0.007858583703637123, '1': 0.9921413660049438}},
  'text': 'The patient has a sore knee and headache but denies <e>nausea</e> '
          'and has no anosmia.'},
 {'Negation': {'prediction': '1',
               'probs': {'-1': 0.0071166763082146645, '1': 0.9928833246231079}},
  'text': 'The patient has a sore knee and headache but denies nausea and has '
          'no <e>anosmia</e>.'}]

You can also serve multiple models at once by providing a router prefix for each model, e.g.:

cnlpt rest --model /negation=mlml-chip/negation_pubmedbert_sharpseed --model /temporal=mlml-chip/thyme2_colon_e2e

Citing cnlp_transformers

Please use the following bibtex to cite cnlp_transformers if you use it in a publication:

@misc{cnlp_transformers,
  author       = {CNLPT},
  title        = {Clinical {NLP} {Transformers} (cnlp\_transformers)},
  year         = {2021},
  publisher    = {GitHub},
  journal      = {GitHub repository},
  howpublished = {\url{https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers}},
}

Publications using cnlp_transformers

Please send us any citations that used this library!

  1. Chen S, Guevara M, Ramirez N, Murray A, Warner JL, Aerts HJWL, et al. Natural Language Processing to Automatically Extract the Presence and Severity of Esophagitis in Notes of Patients Undergoing Radiotherapy. JCO Clin Cancer Inform. 2023 Jul;(7):e2300048.
  2. Li Y, Miller T, Bethard S, Savova G. Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information [Internet]. arXiv.org. 2024 [cited 2025 May 22]. Available from: https://arxiv.org/abs/2410.12774v1
  3. Wang L, Li Y, Miller T, Bethard S, Savova G. Two-Stage Fine-Tuning for Improved Bias and Variance for Large Pretrained Language Models. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) [Internet]. Toronto, Canada: Association for Computational Linguistics; 2023 [cited 2025 May 22]. p. 15746–61. Available from: https://aclanthology.org/2023.acl-long.877/
  4. Miller T, Bethard S, Dligach D, Savova G. End-to-end clinical temporal information extraction with multi-head attention. Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:313–9.
  5. Yoon W, Ren B, Thomas S, Kim C, Savova G, Hall MH, et al. Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction [Internet]. arXiv; 2025 [cited 2025 May 22]. Available from: http://arxiv.org/abs/2502.10388
  6. Wang L, Zipursky AR, Geva A, McMurry AJ, Mandl KD, Miller TA. A computable case definition for patients with SARS-CoV2 testing that occurred outside the hospital. JAMIA Open. 2023 Oct 1;6(3):ooad047.
  7. Bitterman DS, Goldner E, Finan S, Harris D, Durbin EB, Hochheiser H, et al. An End-to-End Natural Language Processing System for Automatically Extracting Radiation Therapy Events From Clinical Texts. Int J Radiat Oncol Biol Phys. 2023 Sep 1;117(1):262–73.
  8. McMurry AJ, Gottlieb DI, Miller TA, Jones JR, Atreja A, Crago J, et al. Cumulus: A federated EHR-based learning system powered by FHIR and AI. medRxiv. 2024 Feb 6;2024.02.02.24301940.
  9. LCD benchmark: long clinical document benchmark on mortality prediction for language models | Journal of the American Medical Informatics Association | Oxford Academic [Internet]. [cited 2025 Jan 23]. Available from: https://academic.oup.com/jamia/article-abstract/32/2/285/7909835?redirectedFrom=fulltext

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cnlp_transformers-0.8.0.tar.gz (317.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cnlp_transformers-0.8.0-py3-none-any.whl (126.3 kB view details)

Uploaded Python 3

File details

Details for the file cnlp_transformers-0.8.0.tar.gz.

File metadata

  • Download URL: cnlp_transformers-0.8.0.tar.gz
  • Upload date:
  • Size: 317.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for cnlp_transformers-0.8.0.tar.gz
Algorithm Hash digest
SHA256 f69a7844d07b5dd5698cb3efde56b818f8ab1a2f9bc296fe924e3e8ee868f150
MD5 6495f9460ca970f47974aa9177ec36cd
BLAKE2b-256 eb3dae7e5bb8753af62a8da29633301735342c75dce1ad50adc67458a67030d8

See more details on using hashes here.

File details

Details for the file cnlp_transformers-0.8.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cnlp_transformers-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd7db87f4d4b5807684f058f8cda470abf5d5a5ac71466a72c6e6826deeedcc8
MD5 ce9ecbf951b0307f8e33053bda58cf7e
BLAKE2b-256 87d61e7a1b458f1f0439af1a4d0f2b359ce6d226c78fc182dfd9d5f27dadb6f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page