Hub for Portuguese NLP resources

Project description

PT Pump Up Client

Use Cases

Train Semantic Role Labeller

from pt_pump_up.benchmarking import TrainerFactory
from pt_pump_up.benchmarking.training_strategies.SemanticRoleLabellingStrategy import SemanticRoleLabellingStrategy
from datasets import load_dataset

propbank_br = load_dataset("liaad/Propbank-BR", "flatten")

for model_name in ['neuralmind/bert-base-portuguese-cased', 'neuralmind/bert-large-portuguese-cased', 'PORTULAN/albertina-100m-portuguese-ptpt-encoder' 'PORTULAN/albertina-900m-portuguese-ptpt-encoder']:

    repository_name = f"SRL-{model_name.split('/')[1]}"

    trainer = TrainerFactory.create(
        nlp_task="SRL",
        repository_name=repository_name,
        model_name=model_name,
        label_names=propbank_br['train'].features['frames'].feature.names,
        max_epochs=30,
        lr=1e-5,
        train_dataset=propbank_br['train'],
        eval_dataset=propbank_br['test'],
    )

    trainer.train()

    SemanticRoleLabellingStrategy.create_pipeline(
        hf_repo=repository_name,
        model=trainer.model,
        tokenizer=trainer.tokenizer,
    )

Train Sentiment Analyser

from datasets import load_dataset

# Usage of the PT-Pump-Up library is not mandatory, but it will make your life easier.
# It reuses code previously developed for similar NLP tasks. That is already tested and validated.
from pt_pump_up.benchmarking import TrainerFactory


# Load dataset from huggingface/datasets.
# StanfordNLP/IMDB is a dataset for sentiment analysis in English.
# It is as simple as it can be. It has only two columns: text and label.
imdb = load_dataset("stanfordnlp/imdb")

# There are 4 transformers models that can be adapted for sentiment analysis in Portuguese:
# - neuralmind/bertimbau version base (110M) and large (335M) (Bigger/Computational Expensive Architecture)
# - PORTULAN/albertina version 100m and 900m. Avaiable in PT-PT and PT-BR
for model_name in ['neuralmind/bert-base-portuguese-cased', 'neuralmind/bert-large-portuguese-cased', 'PORTULAN/albertina-100m-portuguese-ptpt-encoder' 'PORTULAN/albertina-900m-portuguese-ptpt-encoder']:
    # You should specify the repository name for each model to be trained.
    # It will be available in the Hugging Face Hub under that name.
    # Ex: f"dataset-SRL-{model_name.split('/')[1]}" produced https://huggingface.co/liaad/propbank_br_srl_bert_base_portuguese_cased
    repository_name = "<<REPOSITORY_NAME>>"

    trainer = TrainerFactory.create(
        # Sentiment Analysis is a Text Classification task.
        nlp_task="Text Classification",
        repository_name=repository_name,
        model_name=model_name,
        # label_names is a list of strings with the possible labels in the dataset.
        # If the dataset is correctly loaded, you can access the label names with dataset['train'].features[<<LABEL_COLUMN_NAME>>].feature.names
        # In this case, the label column name is 'label'.
        # If not proprely loaded, you can use a list of strings with the possible labels. Ex: ['Positive', 'Negative'], assuming that the labels are 'Positive' and 'Negative'.
        label_names=imdb['train'].features['label'].names,
        max_epochs=30,
        lr=1e-5,
        train_dataset=imdb['train'],
        eval_dataset=imdb['test'],
    )

    trainer.train()

Project details

Release history Release notifications | RSS feed

This version

0.0.11

May 4, 2024

0.0.10

May 4, 2024

0.0.9

Mar 20, 2024

0.0.8

Mar 20, 2024

0.0.7

Mar 20, 2024

0.0.6

Mar 13, 2024

0.0.5

Mar 13, 2024

0.0.4

Mar 13, 2024

0.0.3

Mar 10, 2024

0.0.2

Jan 9, 2024

0.0.1

Dec 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pt_pump_up-0.0.11.tar.gz (13.4 kB view hashes)

Uploaded May 4, 2024 Source

Built Distribution

pt_pump_up-0.0.11-py3-none-any.whl (17.7 kB view hashes)

Uploaded May 4, 2024 Python 3

Hashes for pt_pump_up-0.0.11.tar.gz

Hashes for pt_pump_up-0.0.11.tar.gz
Algorithm	Hash digest
SHA256	`8bab371565b7774732c489947fd3bbfc0ecfdee61d0899cb999ec6fafc35ebd1`
MD5	`933a7017d8cf720cdb0ffbfef8928c08`
BLAKE2b-256	`eb58e3414ea41a97b9a897c1f3bc2217bd9644b0647118aa1beb5beb55bf8f4d`

Hashes for pt_pump_up-0.0.11-py3-none-any.whl

Hashes for pt_pump_up-0.0.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d406fcc8906413d3905378da2d23717ab737de0d7630cc42de5bb07bfc43682b`
MD5	`4c852fd291788c9fbe2f8875c377349b`
BLAKE2b-256	`6f313285436b9333aac4fce1596c4b4daa5c05f2fd0e449bacde01c916f50369`