Adapt Transformer-based language models to new text domains
Project description
This toolkit improves the performance of HuggingFace transformer models on downstream NLP tasks, by domain-adapting models to the target domain of said NLP tasks (e.g. BERT -> LawBERT).
The overall Domain Adaptation framework can be broken down into three phases:
- Data Selection
Select a relevant subset of documents from the in-domain corpus that is likely to be beneficial for domain pre-training (see below)
- Vocabulary Augmentation
Extending the vocabulary of the transformer model with domain specific-terminology
- Domain Pre-Training
Continued pre-training of transformer model on the in-domain corpus to learn linguistic nuances of the target domain
After a model is domain-adapted, it can be fine-tuned on the downstream NLP task of choice, like any pre-trained transformer model.
Components
This toolkit provides two classes, DataSelector
and VocabAugmentor
, to simplify the Data Selection and Vocabulary Augmentation steps respectively.
Installation
This package was developed on Python 3.6+ and can be downloaded using pip
:
pip install transformers-domain-adaptation
Features
- Compatible with the HuggingFace ecosystem:
transformers 4.x
tokenizers
datasets
Usage
Please refer to our Colab guide!
Results
TODO
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for transformers-domain-adaptation-0.3.0a10.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6131dbacc1cb5cf7914d9066ca92f3856d01afb4dcfd0add72523926888f9c78 |
|
MD5 | 73aa1e06ecbdce8bb4ed125c7eb7582a |
|
BLAKE2b-256 | 7f211e34fa8cc1346226c09cb309e341f1da5afc6051a684dc1d7dbb40d788d2 |
Hashes for transformers_domain_adaptation-0.3.0a10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ad4f956139a8775633909ac05bc5e2388010f04d94ca7c6c9e9197d0de9145f |
|
MD5 | b4ff8765ffdd12b9351cb77f97bdeced |
|
BLAKE2b-256 | 2db31607c96144f4ab1825a5f2fee30390482c27040a403e5981d22a2f84cec7 |