long document classification with language models
Project description
:book: BERT Long Document Classification :book:
an easy-to-use interface to fully trained BERT based models for multi-class and multi-label long document classification.
pre-trained models are currently available for two clinical note (EHR) phenotyping tasks: smoker identification and obesity detection.
To sustain future development and improvements, we interface pytorch-transformers for all language model components of our architectures. Additionally, their is a blog post describing the architecture.
Model | Dataset | # Labels | Evaluation F1 |
---|---|---|---|
n2c2_2006_smoker_lstm | I2B2 2006: Smoker Identification | 4 | 0.981 |
n2c2_2008_obesity_lstm | I2B2 2008: Obesity and Co-morbidities Identification | 15 | 0.997 |
Installation
Install with pip:
pip install bert_document_classification
or directly:
pip install git+https://github.com/AndriyMulyar/bert_document_classification
Use
Maps text documents of arbitrary length to binary vectors indicating labels.
from bert_document_classification.models import SmokerPhenotypingBert
from bert_document_classification.models import ObesityPhenotypingBert
smoking_classifier = SmokerPhenotypingBert(device='cuda', batch_size=10) #defaults to GPU prediction
obesity_classifier = ObesityPhenotypingBert(device='cpu', batch_size=10) #or CPU if you would like.
smoking_classifier.predict(["I'm a document! Make me long and the model can still perform well!"])
More examples.
Notes
- For training you will need a GPU.
- For bulk inference where speed is not of concern lots of available memory and CPU cores will likely work.
- Model downloads are cached in
~/.cache/torch/bert_document_classification/
. Try clearing this folder if you have issues.
Acknowledgement
If you found this project useful, consider citing our extended abstract accepted at NeurIPS 2019 ML4Health .
Format bibtex citation
Implementation, development and training in this project were supported by funding from the Mark Dredze Lab at Johns Hopkins University.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bert_document_classification-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74e91b3932fa34cb9008170d57c219e65a0178b800ea6928f601c6153f193450 |
|
MD5 | 3d1a7e85dd8fb3e5709e3a34f6e2317b |
|
BLAKE2b-256 | 04cf7d774c7b9eef0f0f8299ca0a3942133c1460d9a6262e6eb0ccb07f90419d |
Hashes for bert_document_classification-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d4559fa8e15d2fb800cedfdc79c14266d7b325c31ed084564ddec3707217480 |
|
MD5 | ceebce09c73cabbd6a834976d6fbffc0 |
|
BLAKE2b-256 | f9e0bfce41dcb17179d538c46093e04a8925b63c913dae9a269aca51b0e2d701 |