Sentence tokenizer for text from clinical notes.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

clinitokenizer

clinitokenizer is a sentence tokenizer for clinical text to split unstructured text from clinical text (such as Electronic Medical Records) into individual sentences.

General English sentence tokenizers are often unable to correctly parse medical abbreviations, jargon, and other conventions often used in medical records (see "Motivating Examples" section below). clinitokenizer is specifically trained on medical record data and can perform better in these situations (conversely, for non-domain specific use, using more general sentence tokenizers may yield better results).

The model has been trained on multiple datasets provided by i2b2 (now n2c2). Please visit the n2c2 site to request access to the dataset.

Installation

pip install clinitokenizer

Quickstart

from clinitokenizer.tokenize import clini_tokenize

text = "He was asked if he was taking any medications. Patient is currently taking 5 m.g. Tylenol."
sents = clini_tokenize(text)
# sents = ['He was asked if he was taking any medications.',
#         'Patient is currently taking 5 m.g. Tylenol.']

You can use clinitokenizer as a drop-in replacement for nltk's sent_tokenize function:

# to swap in clinitokenizer, replace the nltk import...
from nltk.tokenize import sent_tokenize

# ... with the following clinitokenizer import:
from clinitokenizer.tokenize import clini_tokenize as sent_tokenize

# and tokenizing should work in the same manner!
tokenized_sents = sent_tokenize(text)

Technical Details

clinitokenizer uses a bert-large Transformer model fine-tuned on sentences from Electronic Medical Records provided from the i2b2/n2c2 dataset. The model has been fine-tuned and is inferenced using the Simple Transformers library, and the model is hosted on HuggingFace .

The model can be run on GPU or CPU, and will automatically switch depending on availability of GPU.

Tradeoffs and Considerations

clinitokenizer uses a large neural network (about 1.2 GB) which will be downloaded and cached on-device on first run. This initial setup may take a few minutes, but should only happen once.

Compared to other off-the-shelf sentence tokenizers (i.e. nltk), clinitokenizer will run slower and consume more memory when running on CPU, so if near-instant tokenization is the goal, using a GPU-based machine or another tokenizer may be better. On a machine with GPU, the time difference is negligable.

clinitokenizer is optimized for natural-language text in the clinical domain. Therefore, when tokenizing more general English sentences or for tasks in a different domain, other generalized tokenizers may perform better.

Additional Configuration

See the CliniTokenize class for more configuration options (more documentation coming soon).

Motivating Examples

Below are some examples of clinical text comparing clinitokenizer to nltk.tokenize.sent_tokenize:

"He was asked if he was taking any medications. Patient is currently taking 5 m.g. Tylenol."

notes: Challenge here is not mistaking m.g. for end-of-sentence.

nltk output:

He was asked if he was taking any medications.
Patient is currently taking 5 m.g.
Tylenol.

clinitokenizer output:

He was asked if he was taking any medications. 
Patient is currently taking 5 m.g. Tylenol.

"Pt. has hx of alcohol use disorder He is recovering."

notes: Challenge here is there is a typo after 'disorder', missing a period. Can tokenizer semantically identify new sentence?

nltk output:

Pt.
has hx of alcohol use disorder He is recovering.

clinitokenizer output:

Pt. has hx of alcohol use disorder 
He is recovering.

"Pt. has hx of alcohol use disorder but He is recovering."

notes: Opposite as previous example -- here, there is an accidental capitalization. Can tokenizer semantically identify it is NOT a new sentence?

nltk output:

Pt.
has hx of alcohol use disorder but He is recovering.

clinitokenizer output:

Pt. has hx of alcohol use disorder but He is recovering.

"Past Medical History: Patient has PMH of COPD."

notes: "Past Medical History" is a sentence header. Even though it is technically a single sentence according to English grammar, when extracting section headers it may be important to identify them as distinct from all sentences under that header.

nltk output:

Past Medical History: Patient has PMH of COPD.

clinitokenizer output:

Past Medical History: 
Patient has PMH of COPD.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.5

Sep 2, 2022

0.0.4

May 27, 2022

0.0.3

May 27, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clinitokenizer-0.0.5.tar.gz (8.6 kB view details)

Uploaded Sep 2, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clinitokenizer-0.0.5-py3-none-any.whl (8.7 kB view details)

Uploaded Sep 2, 2022 Python 3

File details

Details for the file clinitokenizer-0.0.5.tar.gz.

File metadata

Download URL: clinitokenizer-0.0.5.tar.gz
Upload date: Sep 2, 2022
Size: 8.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for clinitokenizer-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`aeda8ab8d6fa68e4e2ec3de62e3796e2fcaa97c35371665d39f0a11cfc639e87`
MD5	`95c6e9d6c6df286a51f5527eed01d1ff`
BLAKE2b-256	`1a6df10cd8d05ad1b8e8d178512c87444791a945087faa0e7c54746cb5cca1e4`

See more details on using hashes here.

File details

Details for the file clinitokenizer-0.0.5-py3-none-any.whl.

File metadata

Download URL: clinitokenizer-0.0.5-py3-none-any.whl
Upload date: Sep 2, 2022
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for clinitokenizer-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d68d9b6195266b2e20edadbb4530d81815706bacc68cb5ef80f2676b2ce5c1a`
MD5	`f64a722baed303f62b97e81ba741dbde`
BLAKE2b-256	`3e1b4821ad9e7b7dd281fc991a45b7dca21cda64e12fee806c8eb8e097dfeade`

See more details on using hashes here.

clinitokenizer 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

clinitokenizer

Installation

Quickstart

Technical Details

Tradeoffs and Considerations

Additional Configuration

Motivating Examples

"He was asked if he was taking any medications. Patient is currently taking 5 m.g. Tylenol."

"Pt. has hx of alcohol use disorder He is recovering."

"Pt. has hx of alcohol use disorder but He is recovering."

"Past Medical History: Patient has PMH of COPD."

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes