Skip to main content

Weak POS Tagging Model

Project description

Weak POS Tagger Model

This package utilizes two models for labeling parts of speech. The first model uses files containing lists of words of a certain gramatical class. The second model uses rules to break ambiguity when the first model cannot decide on only one class for a word.

The package takes as input sentences in the form of strings and outputs a string with the POS Tags of the words on the sentence.

Installation

The Weak POS Tagger can be installed from PyPi:

pip install weak_postagger

Usage

POS Tag Classes Accepted

This project utilizes portuguese part of speech classes. The accepted classes and their tokens are as follow:

'verbos': 'VERB',
'adjetivos': 'ADJ',
'adverbios': 'ADV',
'artigos': 'ART',
'conjuncoes': 'CONJ',
'interjeicoes': 'INT',
'substantivos': 'SUBS',
'pronomes': 'PRON',
'numeros': 'NUM',
'preposicoes': 'PREP',
'participios': 'PART'

Text Pre Processing

All text used is pre processed by the default utilizing the following operations:

  • Case lowering
  • Adding space around punctuation
  • Removing non-ASCII characters

It can also optionally perform the following tokenizations:

  • E-mails
  • Urls
  • Numbers
  • Codes

In order to use the optional pre processing the user needs to pass a list containing EMAIL, URL, NUMBER and/or CODE. It can be passed in the instatiation of the class as shown bellow:

tokenization_options = ['EMAIL', 'CODE']
weak_postag = WeakPOSTagging('directory_path/', tokenization_options)

List Based Model

Files

In order to label a string using the list based model the user needs to create a directory containing one text file for each part of speech class it wants to use. The name of each file needs to contain the name of the part of speech class and the contents of the file must be words that are classified as part of that class. For example: We may have a file called substantivos.txt and it would contain the following words:

carro  
mesa  
banana  

And also have a second file called adjetivos.txt and it would contain the words:

azul  
lento  
calmo  

Weak Labeling

The default weak labeling utilizes two steps:

  • The first step is the labeling module ListWeakModel created with the files input in the class.
  • The second step is the label correction module RuleBasedDisambiguation.

In order to use the default pipeline to label a sentence the user needs to first instantiate the class passing the path of the directory where the files are stored. The user then can use this class to label a sentence:

weak_postaging = WeakPOSTagging('directory_path/')
sentence = "Uma banana verde"
labeled_sentence = weak_postaging.label_sentence(sentences)

And the user should receive back the result:

'ART SUBS ADJ'

The user can also specify which optional text pre processing will be applied on the sentence:

tokenization_options = ['EMAIL', 'CODE']
weak_postag = WeakPOSTagging('directory_path/', tokenization_options)  
sentence = "o meu contato é research@email.com"
labeled_sentence = weak_postaging.label_sentence(sentences)

And the user should receive back the result:

'ART PRON VERB SUBS'

Setting a custom pipeline

A custom labeling pipeline can be created by clearing the default pipeline and adding each step to it in the order they need to be executed.

weak_postaging = WeakPOSTagging('directory_path/')
weak_postaging.clear()

list_model_1 = ListWeakModel('directory/')
list_model_2 = ListWeakModel('directory_two/')
rule_weak_model = RuleBasedDisambiguation()

weak_postaging.add_pipeline_step(list_model_1)
        .add_pipeline_step(rule_weak_model)
        .add_pipeline_step(list_model_2)
        .add_pipeline_step(rule_weak_model)

It is important to note that the first step should always be a list based one.

Contribute

If this is the first time you are contributing to this project, first create the virtual environment using the following command:

conda env create -f env/environment.yml

Then activate the environment:

conda activate weakpostag_env

To test your modifications build the package:

pip install dist\weak_postagger-0.0.1-py3-none-any.whl --force-reinstall

Then run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

weak-postagger-0.1.2.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

weak_postagger-0.1.2-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file weak-postagger-0.1.2.tar.gz.

File metadata

  • Download URL: weak-postagger-0.1.2.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.6

File hashes

Hashes for weak-postagger-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4c1b2c6bb409c64492349a91e4422015ec9f1bb92fb8210317e6ed0b4bf8a498
MD5 ccd196c04e97879a2a40f69d35a9d1b5
BLAKE2b-256 c6a3d0ad394fc2cc8a08fed3b7939a153e616081d80c1672f39075c6ae2d7b6e

See more details on using hashes here.

File details

Details for the file weak_postagger-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: weak_postagger-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.6

File hashes

Hashes for weak_postagger-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 06115d49c86029db447c98097f05a4bcc5944f98a9e15ef9fe667af491429057
MD5 2ce32f40ac4bf1628da0cdabe9a65783
BLAKE2b-256 f8e6b036399b638381c7864d832dd38ba8c89dcdb76dd5736dc889af67c576ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page