Skip to main content

long document classification with language models

Project description

:book: BERT Long Document Classification :book:

an easy-to-use interface to fully trained BERT based models for multi-class and multi-label long document classification.

pre-trained models are currently available for two clinical note (EHR) phenotyping tasks: smoker identification and obesity detection.

To sustain future development and improvements, we interface pytorch-transformers for all language model components of our architectures. Additionally, their is a blog post describing the architecture.

Model Dataset # Labels Evaluation F1
n2c2_2006_smoker_lstm I2B2 2006: Smoker Identification 4 0.981
n2c2_2008_obesity_lstm I2B2 2008: Obesity and Co-morbidities Identification 15 0.997

Installation

Install with pip:

pip install bert_document_classification

or directly:

pip install git+https://github.com/AndriyMulyar/bert_document_classification

Use

Maps text documents of arbitrary length to binary vectors indicating labels.

from bert_document_classification.models import SmokerPhenotypingBert
from bert_document_classification.models import ObesityPhenotypingBert

smoking_classifier = SmokerPhenotypingBert(device='cuda', batch_size=10) #defaults to GPU prediction

obesity_classifier = ObesityPhenotypingBert(device='cpu', batch_size=10) #or CPU if you would like.

smoking_classifier.predict(["I'm a document! Make me long and the model can still perform well!"])

More examples.

Notes

  • For training you will need a GPU.
  • For bulk inference where speed is not of concern lots of available memory and CPU cores will likely work.
  • Model downloads are cached in ~/.cache/torch/bert_document_classification/. Try clearing this folder if you have issues.

Acknowledgement

If you found this project useful, consider citing our extended abstract accepted at NeurIPS 2019 ML4Health .

Format bibtex citation

Implementation, development and training in this project were supported by funding from the Mark Dredze Lab at Johns Hopkins University.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bert_document_classification-1.0.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file bert_document_classification-1.0.0.tar.gz.

File metadata

  • Download URL: bert_document_classification-1.0.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for bert_document_classification-1.0.0.tar.gz
Algorithm Hash digest
SHA256 74e91b3932fa34cb9008170d57c219e65a0178b800ea6928f601c6153f193450
MD5 3d1a7e85dd8fb3e5709e3a34f6e2317b
BLAKE2b-256 04cf7d774c7b9eef0f0f8299ca0a3942133c1460d9a6262e6eb0ccb07f90419d

See more details on using hashes here.

File details

Details for the file bert_document_classification-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: bert_document_classification-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for bert_document_classification-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d4559fa8e15d2fb800cedfdc79c14266d7b325c31ed084564ddec3707217480
MD5 ceebce09c73cabbd6a834976d6fbffc0
BLAKE2b-256 f9e0bfce41dcb17179d538c46093e04a8925b63c913dae9a269aca51b0e2d701

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page