Aim to be a convenient NLP library with the help from HuggingFace
Project description
Welcome to that-nlp-library
Install
pip install that_nlp_library
It is advised that you manually install torch (with your compatible cuda version if you GPU). Typically it’s
pip3 install torch --index-url https://download.pytorch.org/whl/cu118
Visit Pytorch page for more information
High-Level Overview
Supervised Learning
For supervised learning, the main pipeline contains 2 parts:
Text Data Controller: TextDataController
(for text processing)
Here is a list of processings that you can use (in order). You also can skip any processing if you want to.
Here is an example of the Text Controller for a classification task
(predict Division Name
), without any text preprocessing. The code will
also tokenize your text field.
tdc = TextDataController.from_csv('sample_data/Womens_Clothing_Reviews.csv',
main_text='Review Text',
label_names='Division Name',
sup_types='classification',
)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)
And here is an example when all processings are applied
from underthesea import text_normalize
import nlpaug.augmenter.char as nac
# define the augmentation function
def nlp_aug(x,aug=None):
results = aug.augment(x)
if not isinstance(x,list): return results[0]
return results
aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug,aug=aug)
# initialize the TextDataController
tdc = TextDataController.from_csv(dset,
main_text='Review Text',
# metadatas
metadatas='Title',
# label
label_names='Division Name',
sup_types='classification',
label_tfm_dict={'Division Name': lambda x: x if x!='Initmates' else 'Intimates'},
# row filter
filter_dict={'Review Text': lambda x: x is not None,
'Division Name': lambda x: x is not None,
},
# text transformation
content_transformation=[text_normalize,str.lower],
# validation split
val_ratio=0.2,
stratify_cols=['Division Name'],
# upsampling
upsampling_list=[('Division Name',lambda x: x=='Intimates')]
# text augmentation
content_augmentations=nearby_aug_func
)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)
For an in-depth tutorial on Text Controller for Supervised Learning
(TextDataController
),
please visit
here
This library also a streamed version of Text Controller
(TextDataControllerStreaming
),
allowing you to work with data without having it entirely on your hard
drive. You can still perform all the processings in the non-streamed
version, except for Train/Validation split (which means you have to
define your validation set beforehand), and Upsampling.
For more details on streaming, visit how to create a streamed dataset and how to train a model with a streamed dataset
If you are curious on the time and space efficiency between streamed and non-streamed version, visit the benchmark here
Model and ModelController
The library can perform the following:
-
Classification (simple tutorial)
-
Multiheads, where each head can be either classification or regression
-
“Multihead” is when your model needs to predict multiple outputs at once, for example, given a sentence (e.g. a review on an e-commerce site), you have to predict what category the sentence is about, and the sentiment of the sentence, and maybe the rating of the sentence.
-
For the above example, this is a 3-head problem: classification (for category), classification (for sentiment), and regression (for rating from 1 to 5)
-
-
For 2-head classification where there’s hierarchical relationship between the first output and the second output (e.g. the first output is level 1 clothing category, and the second output is the level 2 clothing subcategory), you can utilize two specific approaches for this use-case: training with conditional probability, or with deep hierarchical classification
Decoupling of Text Controller and Model Controller
In this library, you can either use
TextDataController
only to handle all the text processings, and have the final
processed-HuggingFace-DatasetDict returned to you. But if you have your
own processed DatasetDict, you can skip the text controller and use only
the
ModelController
for training your data. There’s a quick tutorial on this decoupling
here
Language Modeling
For language modeling, the main pipeline also contains 2 parts
Text Data Controller for Language Model: TextDataLMController
Similarly to TextDatController
,
TextDataLMController
also provide a list of processings (except for Label Processing,
Upsampling and Text Augmentation). The controller also allow
tokenization line-by-line or by token concatenation. Visit the tutorial
here
There’s also a streamed version
(TextDataLMControllerStreaming
)
Language Model Controller: ModelLMController
The library can train a masked language modeling (BERT, roBERTa …) or a causal language model (GPT) either from scratch or from existing pretrained language models.
Hidden States Extraction
The library also allow you to extract the hidden states of your choice, for further analysis
Documentation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file that-nlp-library-0.2.2.tar.gz
.
File metadata
- Download URL: that-nlp-library-0.2.2.tar.gz
- Upload date:
- Size: 47.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fed2f5b1ab75d6eb4fafca6615428106d4bd3e6d732f9601747acd04b67a1466 |
|
MD5 | 070880e54605e22a67b3d06da5561f91 |
|
BLAKE2b-256 | b7c227b953d5012d5db8aa78724b410b46951f61696fd04b15f60033adebaa53 |
File details
Details for the file that_nlp_library-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: that_nlp_library-0.2.2-py3-none-any.whl
- Upload date:
- Size: 60.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b06821a2ccf1d1b816dc4fed65999f14d8f340a7d1398b28d04c12df5d731073 |
|
MD5 | 5527955d1e220f727b24f482d24b9168 |
|
BLAKE2b-256 | e486cc793f2cf3c38593ce14f430ac679b48f35fa7a3a13a27da31d1364468c0 |