Hub for Portuguese NLP resources
Project description
PT Pump Up Client
Use Cases
Train Semantic Role Labeller
from pt_pump_up.benchmarking import TrainerFactory
from pt_pump_up.benchmarking.training_strategies.SemanticRoleLabellingStrategy import SemanticRoleLabellingStrategy
from datasets import load_dataset
propbank_br = load_dataset("liaad/Propbank-BR", "flatten")
for model_name in ['neuralmind/bert-base-portuguese-cased', 'neuralmind/bert-large-portuguese-cased', 'PORTULAN/albertina-100m-portuguese-ptpt-encoder' 'PORTULAN/albertina-900m-portuguese-ptpt-encoder']:
repository_name = f"SRL-{model_name.split('/')[1]}"
trainer = TrainerFactory.create(
nlp_task="SRL",
repository_name=repository_name,
model_name=model_name,
label_names=propbank_br['train'].features['frames'].feature.names,
max_epochs=30,
lr=1e-5,
train_dataset=propbank_br['train'],
eval_dataset=propbank_br['test'],
)
trainer.train()
SemanticRoleLabellingStrategy.create_pipeline(
hf_repo=repository_name,
model=trainer.model,
tokenizer=trainer.tokenizer,
)
Train Sentiment Analyser
from datasets import load_dataset
# Usage of the PT-Pump-Up library is not mandatory, but it will make your life easier.
# It reuses code previously developed for similar NLP tasks. That is already tested and validated.
from pt_pump_up.benchmarking import TrainerFactory
# Load dataset from huggingface/datasets.
# StanfordNLP/IMDB is a dataset for sentiment analysis in English.
# It is as simple as it can be. It has only two columns: text and label.
imdb = load_dataset("stanfordnlp/imdb")
# There are 4 transformers models that can be adapted for sentiment analysis in Portuguese:
# - neuralmind/bertimbau version base (110M) and large (335M) (Bigger/Computational Expensive Architecture)
# - PORTULAN/albertina version 100m and 900m. Avaiable in PT-PT and PT-BR
for model_name in ['neuralmind/bert-base-portuguese-cased', 'neuralmind/bert-large-portuguese-cased', 'PORTULAN/albertina-100m-portuguese-ptpt-encoder' 'PORTULAN/albertina-900m-portuguese-ptpt-encoder']:
# You should specify the repository name for each model to be trained.
# It will be available in the Hugging Face Hub under that name.
# Ex: f"dataset-SRL-{model_name.split('/')[1]}" produced https://huggingface.co/liaad/propbank_br_srl_bert_base_portuguese_cased
repository_name = "<<REPOSITORY_NAME>>"
trainer = TrainerFactory.create(
# Sentiment Analysis is a Text Classification task.
nlp_task="Text Classification",
repository_name=repository_name,
model_name=model_name,
# label_names is a list of strings with the possible labels in the dataset.
# If the dataset is correctly loaded, you can access the label names with dataset['train'].features[<<LABEL_COLUMN_NAME>>].feature.names
# In this case, the label column name is 'label'.
# If not proprely loaded, you can use a list of strings with the possible labels. Ex: ['Positive', 'Negative'], assuming that the labels are 'Positive' and 'Negative'.
label_names=imdb['train'].features['label'].names,
max_epochs=30,
lr=1e-5,
train_dataset=imdb['train'],
eval_dataset=imdb['test'],
)
trainer.train()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pt_pump_up-0.0.11.tar.gz
(13.4 kB
view details)
Built Distribution
File details
Details for the file pt_pump_up-0.0.11.tar.gz
.
File metadata
- Download URL: pt_pump_up-0.0.11.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bab371565b7774732c489947fd3bbfc0ecfdee61d0899cb999ec6fafc35ebd1 |
|
MD5 | 933a7017d8cf720cdb0ffbfef8928c08 |
|
BLAKE2b-256 | eb58e3414ea41a97b9a897c1f3bc2217bd9644b0647118aa1beb5beb55bf8f4d |
File details
Details for the file pt_pump_up-0.0.11-py3-none-any.whl
.
File metadata
- Download URL: pt_pump_up-0.0.11-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d406fcc8906413d3905378da2d23717ab737de0d7630cc42de5bb07bfc43682b |
|
MD5 | 4c852fd291788c9fbe2f8875c377349b |
|
BLAKE2b-256 | 6f313285436b9333aac4fce1596c4b4daa5c05f2fd0e449bacde01c916f50369 |