Skip to main content

Hub for Portuguese NLP resources

Project description

PT Pump Up Client

Use Cases

Train Semantic Role Labeller

from pt_pump_up.benchmarking import TrainerFactory
from pt_pump_up.benchmarking.training_strategies.SemanticRoleLabellingStrategy import SemanticRoleLabellingStrategy
from datasets import load_dataset

propbank_br = load_dataset("liaad/Propbank-BR", "flatten")

for model_name in ['neuralmind/bert-base-portuguese-cased', 'neuralmind/bert-large-portuguese-cased', 'PORTULAN/albertina-100m-portuguese-ptpt-encoder' 'PORTULAN/albertina-900m-portuguese-ptpt-encoder']:

    repository_name = f"SRL-{model_name.split('/')[1]}"

    trainer = TrainerFactory.create(
        nlp_task="SRL",
        repository_name=repository_name,
        model_name=model_name,
        label_names=propbank_br['train'].features['frames'].feature.names,
        max_epochs=30,
        lr=1e-5,
        train_dataset=propbank_br['train'],
        eval_dataset=propbank_br['test'],
    )

    trainer.train()

    SemanticRoleLabellingStrategy.create_pipeline(
        hf_repo=repository_name,
        model=trainer.model,
        tokenizer=trainer.tokenizer,
    )

Train Sentiment Analyser

from datasets import load_dataset

# Usage of the PT-Pump-Up library is not mandatory, but it will make your life easier.
# It reuses code previously developed for similar NLP tasks. That is already tested and validated.
from pt_pump_up.benchmarking import TrainerFactory


# Load dataset from huggingface/datasets.
# StanfordNLP/IMDB is a dataset for sentiment analysis in English.
# It is as simple as it can be. It has only two columns: text and label.
imdb = load_dataset("stanfordnlp/imdb")

# There are 4 transformers models that can be adapted for sentiment analysis in Portuguese:
# - neuralmind/bertimbau version base (110M) and large (335M) (Bigger/Computational Expensive Architecture)
# - PORTULAN/albertina version 100m and 900m. Avaiable in PT-PT and PT-BR
for model_name in ['neuralmind/bert-base-portuguese-cased', 'neuralmind/bert-large-portuguese-cased', 'PORTULAN/albertina-100m-portuguese-ptpt-encoder' 'PORTULAN/albertina-900m-portuguese-ptpt-encoder']:
    # You should specify the repository name for each model to be trained.
    # It will be available in the Hugging Face Hub under that name.
    # Ex: f"dataset-SRL-{model_name.split('/')[1]}" produced https://huggingface.co/liaad/propbank_br_srl_bert_base_portuguese_cased
    repository_name = "<<REPOSITORY_NAME>>"

    trainer = TrainerFactory.create(
        # Sentiment Analysis is a Text Classification task.
        nlp_task="Text Classification",
        repository_name=repository_name,
        model_name=model_name,
        # label_names is a list of strings with the possible labels in the dataset.
        # If the dataset is correctly loaded, you can access the label names with dataset['train'].features[<<LABEL_COLUMN_NAME>>].feature.names
        # In this case, the label column name is 'label'.
        # If not proprely loaded, you can use a list of strings with the possible labels. Ex: ['Positive', 'Negative'], assuming that the labels are 'Positive' and 'Negative'.
        label_names=imdb['train'].features['label'].names,
        max_epochs=30,
        lr=1e-5,
        train_dataset=imdb['train'],
        eval_dataset=imdb['test'],
    )

    trainer.train()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pt_pump_up-0.0.11.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

pt_pump_up-0.0.11-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file pt_pump_up-0.0.11.tar.gz.

File metadata

  • Download URL: pt_pump_up-0.0.11.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for pt_pump_up-0.0.11.tar.gz
Algorithm Hash digest
SHA256 8bab371565b7774732c489947fd3bbfc0ecfdee61d0899cb999ec6fafc35ebd1
MD5 933a7017d8cf720cdb0ffbfef8928c08
BLAKE2b-256 eb58e3414ea41a97b9a897c1f3bc2217bd9644b0647118aa1beb5beb55bf8f4d

See more details on using hashes here.

File details

Details for the file pt_pump_up-0.0.11-py3-none-any.whl.

File metadata

  • Download URL: pt_pump_up-0.0.11-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.10

File hashes

Hashes for pt_pump_up-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 d406fcc8906413d3905378da2d23717ab737de0d7630cc42de5bb07bfc43682b
MD5 4c852fd291788c9fbe2f8875c377349b
BLAKE2b-256 6f313285436b9333aac4fce1596c4b4daa5c05f2fd0e449bacde01c916f50369

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page