Skip to main content

Sentence tokenizer for text from clinical notes.

Project description

clinitokenizer

clinitokenizer is a sentence tokenizer for clinical text to split unstructured text from clinical text (such as Electronic Medical Records) into individual sentences.

General English sentence tokenizers are often unable to correctly parse medical abbreviations, jargon, and other conventions often used in medical records (see "Motivating Examples" section below). clinitokenizer is specifically trained on medical record data and can perform better in these situations (conversely, for non-domain specific use, using more general sentence tokenizers may yield better results).

The model has been trained on multiple datasets provided by i2b2 (now n2c2). Please visit the n2c2 site to request access to the dataset.

Quickstart

from clinitokenizer.tokenize import clini_tokenize

text = "He was asked if he was taking any medications. Patient is currently taking 5 m.g. Tylenol."
sents = clini_tokenize(text)
# sents = ['He was asked if he was taking any medications.', 'Patient is currently taking 5 m.g. Tylenol.']

You can use clinitokenizer as a drop-in replacement for nltk's sent_tokenize function:

# to swap in clinitokenizer, replace the nltk import...
from nltk.tokenize import sent_tokenize

# ... with the following clinitokenizer import:
from clinitokenizer.tokenize import clini_tokenize as sent_tokenize

# and tokenizing should work in the same manner!
nltk_sents = sent_tokenize(text)

Technical Details

clinitokenizer uses a bert-large Transformer model fine-tuned on sentences from Electronic Medical Records provided from the i2b2/n2c2 dataset. The model has been fine-tuned and is inferenced using the Simple Transformers library, and the model is hosted on HuggingFace .

The model can be run on GPU or CPU, and will automatically switch depending on availability of GPU.

Tradeoffs and Considerations

clinitokenizer uses a large neural network (about 1.2 GB) which will be downloaded and cached on-device on first run. This initial setup may take a few minutes, but should only happen once.

Compared to other off-the-shelf sentence tokenizers (i.e. nltk), clinitokenizer will run slower (especially on machines without GPU) and consume more memory, so if near-instant tokenization is the goal, using a GPU-based machine or another tokenizer may be better.

clinitokenizer is optimized for natural-language text in the clinical domain. Therefore, when tokenizing more general English sentences or for tasks in a different domain, other generalized tokenizers may perform better.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clinitokenizer-0.0.3.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

clinitokenizer-0.0.3-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file clinitokenizer-0.0.3.tar.gz.

File metadata

  • Download URL: clinitokenizer-0.0.3.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.9

File hashes

Hashes for clinitokenizer-0.0.3.tar.gz
Algorithm Hash digest
SHA256 634d6e5d4685ae0b47c4c375decd69f310b776bcead62c829d111b692f3c4c4c
MD5 42aff544c2743d1c9a9854fc50eea946
BLAKE2b-256 1411cd4b0be2e1f62f43958d7ca3539e8e1617fe419c90974557e3fc8bff90d6

See more details on using hashes here.

File details

Details for the file clinitokenizer-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for clinitokenizer-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bb8fb133638aae04906b1d0d5e8541d7542200097f3459b8b10e31bacfc65909
MD5 0b067dcbf2e4202442427c82b5aebab6
BLAKE2b-256 28703d25c72557fc6aaeaa67716f69f35647c6e65b8fd22cae4b0c342b480a1f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page