A package for creating and training language models for text classification based on BERT. The package includes pre-trained models and a feature for testing the trained models.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Developers
Operating System
Programming Language
- Python :: 3

Project description

Introduction

This package enables the creation and training of language models for text classification using BERT, with prescribed parameters for smaller dataset training. It also comprises six pre-trained models with 27 categories each for experimentation, along with a feature for testing out language models.

The models were trained in these languages, each with its corresponding validation accuracy:

English | 92%
French | 88%
German | 92%
Italian | 89%
Portuguese | 92%
Spanish | 93%

Each model comprises 27 categories, including:

Where to get it

pip install categorium

Functions

The package defines the following functions:

select_csv_file(language): This function loads the category names and order from CSV files.
select_language_model(language): This function loads the pre-trained language models.
select_token(language): This function loads the tokenizer to be used.
test_models(model_trained,tokenize,csv_cat,text): This function tests the trained models.
train_main(): This function tokenizes the text data using the BERT tokenizer.

Usage

Guide to use the functions

To utilize pre-trained models, import the function select_language_model() and specify the language in its parameters. For instance, select_language_model('english') will load the English language model. If you enter a language that doesn't contain a trained model for it, then a message will be displayed indicating that there is no trained model for that language. This principle applies to both select_csv_file() and select_token() functions.

Guide to use the test models function

To test the models' functionality, import the test_models() function using these parameters: (model_trained, tokenize, csv_cat, text). "Model_trained" refers to the model that has been trained. "Tokenize" refers to the created tokenizer. "Csv_cat" refers to the file with the categories' names and their order. Lastly, the "text" parameter refers to the text which needs to be categorised.

Example of using a pre-trained model

from categorium import select_language_model, select_csv_file, select_token



# Load the labels name

df = pd.read_csv(select_csv_file('english'), index_col=False)

# Load the pre-trained model

model = TFBertForSequenceClassification.from_pretrained(select_language_model('english'))

#Load the tokenizer

tokenizer = BertTokenizer.from_pretrained(select_token('english'))



# Text to classify

text = "Insert example text"



# Tokenize the text and get the model's prediction

inputs = tokenizer(text, return_tensors='tf')

outputs = model(inputs)[0]



# Get the predicted category index

predicted_index = tf.argmax(outputs, axis=1).numpy()[0]



# Get the predicted category label

predicted_label = df['cat'].unique()[predicted_index]



# Print the predicted category label

print(predicted_label)

Example of using the package function to test the models

from categorium import select_language_model, select_csv_file, select_token,test_model_utils



# Load the labels name

df = pd.read_csv(select_csv_file('english'), index_col=False)

# Load the pre-trained model

model = TFBertForSequenceClassification.from_pretrained(select_language_model('english'))

#Load the tokenizer

tokenizer = BertTokenizer.from_pretrained(select_token('english'))



test_models(model,tokenizer,df,text):

The following dependencies must be installed to use the training feature:

TensorFlow
Transformers
Pandas
NumPy

Guide to use the training feature

To use the train the model function, create a folder named 'category' in the directory of the script that calls the train_main() function. In this folder, place xlsx format files that contain text in the first column and the respective category in the second column. Once the initial step has been completed, run the script to start training the models and generate a folder named training_files in the same directory as the train_main() function. This folder comprises all the ultimate files for the trained model.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Developers
Operating System
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.2

Nov 30, 2023

0.2.1

Nov 29, 2023

0.2.0

Aug 22, 2023

0.1.0

Jul 3, 2023

0.0.4

Jun 28, 2023

0.0.3

Jun 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

categorium-0.2.2.tar.gz (23.3 MB view hashes)

Uploaded Nov 30, 2023 Source

Hashes for categorium-0.2.2.tar.gz

Hashes for categorium-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`e730884731f699eba65ec82a640455d305f41650431f2796be31027e502c5772`
MD5	`1babbae81e1ac9c1a9eda9eba729fb94`
BLAKE2b-256	`ea8e891b6b74ad98faa36bebbf4e2e36281083782ea31b4cea01a8f8718de11c`