long document classification with language models
Project description
:book: BERT Long Document Classification :book:
an easy-to-use interface to fully trained BERT based models for multi-class and multi-label long document classification.
pre-trained models are currently available for two clinical note (EHR) phenotyping tasks: smoker identification and obesity detection.
To sustain future development and improvements, we interface pytorch-transformers for all language model components of our architectures. Additionally, their is a blog post describing the architecture.
Model | Dataset | # Labels | Evaluation F1 |
---|---|---|---|
n2c2_2006_smoker_lstm | I2B2 2006: Smoker Identification | 4 | 0.981 |
n2c2_2008_obesity_lstm | I2B2 2008: Obesity and Co-morbidities Identification | 15 | 0.997 |
Installation
Install with pip:
pip install bert_document_classification
or directly:
pip install git+https://github.com/AndriyMulyar/bert_document_classification
Use
Maps text documents of arbitrary length to binary vectors indicating labels.
from bert_document_classification.models import SmokerPhenotypingBert
from bert_document_classification.models import ObesityPhenotypingBert
smoking_classifier = SmokerPhenotypingBert(device='cuda', batch_size=10) #defaults to GPU prediction
obesity_classifier = ObesityPhenotypingBert(device='cpu', batch_size=10) #or CPU if you would like.
smoking_classifier.predict(["I'm a document! Make me long and the model can still perform well!"])
More examples.
Notes
- For training you will need a GPU.
- For bulk inference where speed is not of concern lots of available memory and CPU cores will likely work.
- Model downloads are cached in
~/.cache/torch/bert_document_classification/
. Try clearing this folder if you have issues.
Acknowledgement
If you found this project useful, consider citing our extended abstract accepted at NeurIPS 2019 ML4Health .
Format bibtex citation
Implementation, development and training in this project were supported by funding from the Mark Dredze Lab at Johns Hopkins University.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bert_document_classification-1.0.0.tar.gz
.
File metadata
- Download URL: bert_document_classification-1.0.0.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74e91b3932fa34cb9008170d57c219e65a0178b800ea6928f601c6153f193450 |
|
MD5 | 3d1a7e85dd8fb3e5709e3a34f6e2317b |
|
BLAKE2b-256 | 04cf7d774c7b9eef0f0f8299ca0a3942133c1460d9a6262e6eb0ccb07f90419d |
File details
Details for the file bert_document_classification-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: bert_document_classification-1.0.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d4559fa8e15d2fb800cedfdc79c14266d7b325c31ed084564ddec3707217480 |
|
MD5 | ceebce09c73cabbd6a834976d6fbffc0 |
|
BLAKE2b-256 | f9e0bfce41dcb17179d538c46093e04a8925b63c913dae9a269aca51b0e2d701 |