Aim to be a convenient NLP library with the help from HuggingFace

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language
- Python :: 3.9
- Python :: 3.10

Project description

Welcome to that-nlp-library

Install

pip install that_nlp_library

It is advised that you manually install torch (with your compatible cuda version if you GPU). Typically it’s

pip3 install torch --index-url https://download.pytorch.org/whl/cu118

Visit Pytorch page for more information

High-Level Overview

Supervised Learning

For supervised learning, the main pipeline contains 2 parts:

Text Data Controller: `TextDataController` (for text processing)

Here is a list of processings that you can use (in order). You also can skip any processing if you want to.

Here is an example of the Text Controller for a classification task (predict Division Name), without any text preprocessing. The code will also tokenize your text field.

tdc = TextDataController.from_csv('sample_data/Womens_Clothing_Reviews.csv',
                                  main_text='Review Text',
                                  label_names='Division Name',
                                  sup_types='classification',                                  
                                 )
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)

And here is an example when all processings are applied

from underthesea import text_normalize
import nlpaug.augmenter.char as nac

# define the augmentation function
def nlp_aug(x,aug=None):
    results = aug.augment(x)
    if not isinstance(x,list): return results[0]
    return results
aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug,aug=aug)

# initialize the TextDataController
tdc = TextDataController.from_csv(dset,
                                  main_text='Review Text',
                                  
                                  # metadatas
                                  metadatas='Title',
                                  
                                  # label
                                  label_names='Division Name',
                                  sup_types='classification',
                                  label_tfm_dict={'Division Name': lambda x: x if x!='Initmates' else 'Intimates'},
                                  
                                  # row filter
                                  filter_dict={'Review Text': lambda x: x is not None,
                                               'Division Name': lambda x: x is not None,
                                              },
                                              
                                  # text transformation
                                  content_transformation=[text_normalize,str.lower],
                                  
                                  # validation split
                                  val_ratio=0.2,
                                  stratify_cols=['Division Name'],
                                  
                                  # upsampling
                                  upsampling_list=[('Division Name',lambda x: x=='Intimates')]
                                  
                                  # text augmentation
                                  content_augmentations=nearby_aug_func
                                 )

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tdc.process_and_tokenize(tokenizer,max_length=100,shuffle_trn=True)

For an in-depth tutorial on Text Controller for Supervised Learning (TextDataController), please visit here

This library also a streamed version of Text Controller (TextDataControllerStreaming), allowing you to work with data without having it entirely on your hard drive. You can still perform all the processings in the non-streamed version, except for Train/Validation split (which means you have to define your validation set beforehand), and Upsampling.

For more details on streaming, visit here.

If you are curious on the time and space efficiency between streamed and non-streamed version, visit the benchmark here

Model and `ModelController`

The library can perform the following:

Classification (simple tutorial)
Regression
Multilabel classification
Multiheads, where each head can be either classification or regression
- “Multihead” is when your model needs to predict multiple outputs at once, for example, given a sentence (e.g. a review on an e-commerce site), you have to predict what category the sentence is about, and the sentiment of the sentence, and maybe the rating of the sentence.
- For the above example, this is a 3-head problem: classification (for category), classification (for sentiment), and regression (for rating from 1 to 5)
For 2-head classification where there’s hierarchical relationship between the first output and the second output (e.g. the first output is level 1 clothing category, and the second output is the level 2 clothing subcategory), you can utilize two specific approaches for this use-case: training with conditional probability, or with deep hierarchical classification

Decoupling of Text Controller and Model Controller

In this library, you can either use TextDataController only to handle all the text processings, and have the final processed-HuggingFace-DatasetDict returned to you. But if you have your own processed DatasetDict, you can skip the text controller and use only the ModelController for training your data. There’s a quick tutorial on this decoupling here

Language Modeling

For language modeling, the main pipeline also contains 2 parts

Text Data Controlelr for Language Model: `TextDataLMController`

Similarly to TextDatController, TextDataLMController also provide a list of processings (except for Label Processing, Upsampling and Text Augmentation). The controller also allow tokenization line-by-line or by token concatenation. Visit the tutorial here

There’s also a streamed version (TextDataLMControllerStreaming)

Language Model Controller: `ModelLMController`

The library can train a masked language modeling (BERT, roBERTa …) or a causal language model (GPT) either from scratch or from existing pretrained language models.

Hidden States Extraction

The library also allow you to extract the hidden states of your choice, for further analysis

Documentation

Visit https://anhquan0412.github.io/that-nlp-library/

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language
- Python :: 3.9
- Python :: 3.10

Release history Release notifications | RSS feed

0.2.2

May 3, 2024

This version

0.2.1

Jan 2, 2024

0.2.0

Jan 1, 2024

0.1.9

Dec 30, 2023

0.1.8

Nov 15, 2023

0.1.7

Oct 27, 2023

0.1.6

Oct 10, 2023

0.1.5

Oct 5, 2023

0.1.4

Sep 25, 2023

0.1.3

Sep 25, 2023

0.1.2

Sep 24, 2023

0.1.1

Sep 19, 2023

0.1.0

Sep 10, 2023

0.0.2

Jul 9, 2023

0.0.1

May 15, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

that-nlp-library-0.2.1.tar.gz (47.1 kB view hashes)

Uploaded Jan 2, 2024 Source

Built Distribution

that_nlp_library-0.2.1-py3-none-any.whl (60.2 kB view hashes)

Uploaded Jan 2, 2024 Python 3

Hashes for that-nlp-library-0.2.1.tar.gz

Hashes for that-nlp-library-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`cf4211bee2618e1ecf8e5bbbd274f7ad04bbf47f93ebf80943f77d84572be54a`
MD5	`4e4cd8781cb02f4186a2164bcd5bee30`
BLAKE2b-256	`2ceebb10897d1d0c0a4a7e97b1eff0d65836f6d8a4dbd31cb6b059a45afb7043`

Hashes for that_nlp_library-0.2.1-py3-none-any.whl

Hashes for that_nlp_library-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8e0541650d887dfcc5b268bb798c5890812d7c1d893a0bfaa3ada3ee1745099`
MD5	`49c3deace242bdbff47eed3b7dbe0d55`
BLAKE2b-256	`0b80f763bbef69486f26eb026bcac7373465af2976a6a6de3212d9ac5f67ee71`

that-nlp-library 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Welcome to that-nlp-library

Install

High-Level Overview

Supervised Learning

Text Data Controller: `TextDataController` (for text processing)

Model and `ModelController`

Decoupling of Text Controller and Model Controller

Language Modeling

Text Data Controlelr for Language Model: `TextDataLMController`

Language Model Controller: `ModelLMController`

Hidden States Extraction

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

that-nlp-library 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Welcome to that-nlp-library

Install

High-Level Overview

Supervised Learning

Text Data Controller: TextDataController (for text processing)

Model and ModelController

Decoupling of Text Controller and Model Controller

Language Modeling

Text Data Controlelr for Language Model: TextDataLMController

Language Model Controller: ModelLMController

Hidden States Extraction

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Text Data Controller: `TextDataController` (for text processing)

Model and `ModelController`

Text Data Controlelr for Language Model: `TextDataLMController`

Language Model Controller: `ModelLMController`