Skip to main content

Hub for Portuguese NLP resources

Project description

PT Pump Up Client

Use Cases

Train Semantic Role Labeller

from pt_pump_up.benchmarking import TrainerFactory
from pt_pump_up.benchmarking.training_strategies.SemanticRoleLabellingStrategy import SemanticRoleLabellingStrategy
from datasets import load_dataset

propbank_br = load_dataset("liaad/Propbank-BR", "flatten")

for model_name in ['neuralmind/bert-base-portuguese-cased', 'neuralmind/bert-large-portuguese-cased', 'PORTULAN/albertina-100m-portuguese-ptpt-encoder' 'PORTULAN/albertina-900m-portuguese-ptpt-encoder']:

    repository_name = f"SRL-{model_name.split('/')[1]}"

    trainer = TrainerFactory.create(
        nlp_task="SRL",
        repository_name=repository_name,
        model_name=model_name,
        label_names=propbank_br['train'].features['frames'].feature.names,
        max_epochs=30,
        lr=1e-5,
        train_dataset=propbank_br['train'],
        eval_dataset=propbank_br['test'],
    )

    trainer.train()

    SemanticRoleLabellingStrategy.create_pipeline(
        hf_repo=repository_name,
        model=trainer.model,
        tokenizer=trainer.tokenizer,
    )

Train Sentiment Analyser

from datasets import load_dataset

# Usage of the PT-Pump-Up library is not mandatory, but it will make your life easier.
# It reuses code previously developed for similar NLP tasks. That is already tested and validated.
from pt_pump_up.benchmarking import TrainerFactory


# Load dataset from huggingface/datasets.
# StanfordNLP/IMDB is a dataset for sentiment analysis in English.
# It is as simple as it can be. It has only two columns: text and label.
imdb = load_dataset("stanfordnlp/imdb")

# There are 4 transformers models that can be adapted for sentiment analysis in Portuguese:
# - neuralmind/bertimbau version base (110M) and large (335M) (Bigger/Computational Expensive Architecture)
# - PORTULAN/albertina version 100m and 900m. Avaiable in PT-PT and PT-BR
for model_name in ['neuralmind/bert-base-portuguese-cased', 'neuralmind/bert-large-portuguese-cased', 'PORTULAN/albertina-100m-portuguese-ptpt-encoder' 'PORTULAN/albertina-900m-portuguese-ptpt-encoder']:
    # You should specify the repository name for each model to be trained.
    # It will be available in the Hugging Face Hub under that name.
    # Ex: f"dataset-SRL-{model_name.split('/')[1]}" produced https://huggingface.co/liaad/propbank_br_srl_bert_base_portuguese_cased
    repository_name = "<<REPOSITORY_NAME>>"

    trainer = TrainerFactory.create(
        # Sentiment Analysis is a Text Classification task.
        nlp_task="Text Classification",
        repository_name=repository_name,
        model_name=model_name,
        # label_names is a list of strings with the possible labels in the dataset.
        # If the dataset is correctly loaded, you can access the label names with dataset['train'].features[<<LABEL_COLUMN_NAME>>].feature.names
        # In this case, the label column name is 'label'.
        # If not proprely loaded, you can use a list of strings with the possible labels. Ex: ['Positive', 'Negative'], assuming that the labels are 'Positive' and 'Negative'.
        label_names=imdb['train'].features['label'].names,
        max_epochs=30,
        lr=1e-5,
        train_dataset=imdb['train'],
        eval_dataset=imdb['test'],
    )

    trainer.train()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pt_pump_up-0.0.10.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pt_pump_up-0.0.10-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file pt_pump_up-0.0.10.tar.gz.

File metadata

  • Download URL: pt_pump_up-0.0.10.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for pt_pump_up-0.0.10.tar.gz
Algorithm Hash digest
SHA256 70e88dfd02db42ab0e712b146441c4541085cb673f362fce47a9e86ace20626b
MD5 869c7e5ccbbc042109dc4d36a391cad1
BLAKE2b-256 1df4caf0c725b1dccf728a3d2e0fa706ee89d687a53667a6726c69ae442b4763

See more details on using hashes here.

File details

Details for the file pt_pump_up-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: pt_pump_up-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for pt_pump_up-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 9bd62fd16fd6725b1154043dbd152a8195231edd27952a65624cae4c5b2c62b6
MD5 4d013e19995d4a5707ed5b9ed1e2ab1f
BLAKE2b-256 c5dd5b5ffc38910e573a06014389b1493f64334151ddac7e17d1c96fd6acd7a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page