Program designed to lemmatize the various verbal inflections present in the Brazilian Portuguese language quickly and efficiently.

Project description

Verb lemmatizer for brazilian portuguese language

This program aims to give the infinitive form of a verb in a very fast and effective way on portuguese-BR texts.

Quantitative information about the dataset

Total number of verbs: 9,233
Number of regular verbs: 8,941
Number of irregular verbs: 292
Total number of verbal inflections: 3,419,728

Installation

This package is installed using the command "pip install"

pip install pt-br-verbs-lemmatizer

For more information about this package, see it on: Pypi

Usage Examples

This package was designed to be integrated with other NLP tools, in order to say if a word is or is not a verb we highly recommend you to use the spaCy lib model trained on portuguese corpus.

Simple usage

from pt_br_verbs_lemmatizer import lemmatize

verb = 'apresentava'

verb_lemma = lemmatize(verb)

print(verb_lemma)

Output:

'apresentar'

Execution time

from pt_br_verbs_lemmatizer import lemmatize
import time

verb = 'apresentá-lo-ia'

t1 = time.time()

verb_lemma = lemmatize(verb)

time.sleep(0.1)

t2 = time.time()

duration = round((t2-t1-0.1),8)

print(verb_lemma)
print(f'Duration: {duration} seconds')

Output:

'apresentar'
'Duration: 0.00047889 seconds'

How it was built

First of all we downloaded the Base TeP 2.0 database, which gave us X number of verbs after filtering it.
After that we went to the list of most popular verbs used on portuguese present on https://www.conjugacao.com.br/verbos-populares/ and web scraped the 5000 verbs there.
We compare to the list we had from the Base TeP 2.0, adding the ones who doesn't match.
Then we start web scraping the inflections of all the verbs we got, also using the conjugacao website.
Some additional steps were taken during the scraping process, we add a bunch of inflections endins to be prepared for almost every cenario (except the wrong writening).
Some examples of that is the female form of -lo, -o, -no, etc... which are -la, -a, -na, etc...
Finally we start to build our dictionary architecture to store all that verbs and that could search into it very quickly. Then we just fill it, which is available at the folder "dataset".

Observation: It is possible to find some wrong inflection verbs inside our dataset, we try out many ways to be highly prepared, but, as we don't have a portuguese grammar teacher on board, we may have committed some mistakes. But, just to be clear, we have more than just the common inflection verbs. If you notice any wrong word or some trouble during the execution of this package, please contact us!

Tests against the giant spaCy - lemmatizer - portuguese trained model:

Now we are going to see some tests related to the results spaCy has in his lemmatization and the execution time to find that lemmatized verb, comparing to our program.

Installing and importing spaCy process (click to expand)

pip install -U spacy
pip install -U spacy-lookups-data
python -m spacy download pt_core_news_lg

import spacy
nlp = spacy.load('pt_core_news_lg',enable=["tok2vec","lemmatizer","morphologizer"])

from pt_br_verbs_lemmatizer import lemmatize

texto = '''Hoje vou jogar bola e espero que você esteja saindo com seus amigos também.
Gostaria de abrir a janela, será que você vê o céu? Quero apresentá-la para meus pais.
Eu tinha duas casas, agora só consigo ter uma. Eu apresentá-la-ia para vocês ontem!
Olhando para ele que observava ela.'''

doc = nlp(texto)

for token in doc:

if token.pos_ == 'VERB':
    print('Verb identified:',token.orth_)
    t1 = time.time()
    verb_lemma_spacy = token.lemma_
    time.sleep(0.1)
    t2 = time.time()

    duration_spacy = round((t2-t1-0.1),8)

    print('spaCy:',verb_lemma_spacy,duration_spacy,'seconds.')

    t1 = time.time()
    verb_lemma_mine = lemmatize(token.orth_)
    time.sleep(0.1)
    t2 = time.time()

    duration_mine = round((t2-t1-0.1),8)
    
    print('Mine:',verb_lemma_mine,duration_mine,'seconds.')

    print('-'*40)

Full Output (click to expand)

'''Verb identified: jogar
  spaCy: jogar 0.00021591 seconds.
  Mine: jogar 0.00191703 seconds.
  ----------------------------------------
  Verb identified: espero
  spaCy: esperar 0.00014153 seconds.
  Mine: esperar 0.00021949 seconds.
  ----------------------------------------
  Verb identified: saindo
  spaCy: sair 0.00013509 seconds.
  Mine: sair 0.0001792 seconds.
  ----------------------------------------
  Verb identified: Gostaria
  spaCy: Gostaria 0.00014081 seconds.
  Mine: gostar 0.00018969 seconds.
  ----------------------------------------
  Verb identified: abrir
  spaCy: abrir 0.0001389 seconds.
  Mine: abrir 0.00023022 seconds.
  ----------------------------------------
  Verb identified: será
  spaCy: ser 0.00020018 seconds.
  Mine: ser 0.00017014 seconds.
  ----------------------------------------
  Verb identified: vê
  spaCy: ver 6.261e-05 seconds.
  Mine: ver 0.00018539 seconds.
  ----------------------------------------
  Verb identified: Quero
  spaCy: querer 0.00096145 seconds.
  Mine: querer 0.0001966 seconds.
  ----------------------------------------
  Verb identified: apresentá-la
  spaCy: apresentá-la 0.00013962 seconds.
  Mine: apresentar 0.00027146 seconds.
  ----------------------------------------
  Verb identified: tinha
  spaCy: ter 0.00013342 seconds.
  Mine: ter 0.00016847 seconds.
  ----------------------------------------
  Verb identified: consigo
  spaCy: consigo 0.00016179 seconds.
  Mine: conseguir 0.00019159 seconds.
  ----------------------------------------
  Verb identified: ter
  spaCy: ter 0.00014439 seconds.
  Mine: ter 0.00023308 seconds.
  ----------------------------------------
  Verb identified: apresentá-la-ia
  spaCy: apresentá-la-ia 5.569e-05 seconds.
  Mine: apresentar 0.00023594 seconds.
  ----------------------------------------
  Verb identified: Olhando
  spaCy: Olhando 0.00017633 seconds.
  Mine: olhar 0.00023808 seconds.
  ----------------------------------------
  Verb identified: observava
  spaCy: observar 0.00013556 seconds.
  Mine: observar 0.00020494 seconds.
  ----------------------------------------'''

So, as we can see, although spaCy has better searching times (but we are very close to it), many times it mistakes the lemmatized verbs. To be honest, for my personal tests, almost every time a verb has hyphen "-" spaCy starts to make some confusion.

I want to make it clear: spaCy is one of, if not the, best NLP library available at the moment. What I tried to do was improve the replacements of the inflected verb for the infinitive verb. So, if you want to lemmatize your verbs with much more accuracy I suggest you mix the spaCy and pt-br-verbs-lemmatizer to get the bests results on your portuguese-BR texts!

Tokenizing using spaCy's lemmatizer

texto = '''Tem-se que ter muito cuidado com isso. Tu recomendarias o que? 
Ele apresentava-se como queria. Foi bom tê-lo por perto!
Tu fosse no show ontem? Eu estava olhando e apreciava-a muito.
Esperava-se que ele chegaria mais cedo.'''

doc = nlp(texto)

tokenization = []

print('Verbs:')

t1 = time.time()

for token in doc:
token_text = token.orth_
if not (token.is_punct or token.is_space):
    if token.pos_ == 'VERB':
    print(token_text)
    token_text = token.lemma_
    tokenization.append(token_text.lower())

t2 = time.time()

print('\n')
print(tokenization)
print(f'\nTime: {t2-t1}')

Output:

'''Verbs:
  Tem-se
  ter
  Tu
  apresentava-se
  queria
  tê-lo
  olhando
  apreciava-a
  Esperava-se
  chegaria'''


  ['tem-se', 'que', 'ter', 'muito', 'cuidado', 'com', 'isso', 'tu', 
  'recomendarias', 'o', 'que', 'ele', 'apresentar se', 'como', 'querer', 
  'foi', 'bom', 'ter ele', 'por', 'perto', 'tu', 'fosse', 'no', 'show', 
  'ontem', 'eu', 'estava', 'olhar', 'e', 'apreciava-r', 'muito', 'esperava-se', 
  'que', 'ele', 'chegar', 'mais', 'cedo']

  'Time: 0.0021452903747558594'

Tokenizing using our lemmatizer

texto = '''Tem-se que ter muito cuidado com isso. Tu recomendarias o que? 
Ele apresentava-se como queria. Foi bom tê-lo por perto!
Tu fosse no show ontem? Eu estava olhando e apreciava-a muito.
Esperava-se que ele chegaria mais cedo.'''

doc = nlp(texto)

tokenization = []

print('Verbs:')

t1 = time.time()

for token in doc:
token_text = token.orth_
if not (token.is_punct or token.is_space):
    if token.pos_ == 'VERB':
    print(token_text)
    token_text = lemmatize(token_text)
    tokenization.append(token_text.lower())

t2 = time.time()

print('\n')
print(tokenization)
print(f'\nTime: {t2-t1}')

Output:

'''Verbs:
  Tem-se
  ter
  Tu
  apresentava-se
  queria
  tê-lo
  olhando
  apreciava-a
  Esperava-se
  chegaria'''


  ['ter', 'que', 'ter', 'muito', 'cuidado', 'com', 'isso', 'tu', 
  'recomendarias', 'o', 'que', 'ele', 'apresentar', 'como', 'querer', 
  'foi', 'bom', 'ter', 'por', 'perto', 'tu', 'fosse', 'no', 'show', 
  'ontem', 'eu', 'estava', 'olhar', 'e', 'apreciar', 'muito', 'esperar', 
  'que', 'ele', 'chegar', 'mais', 'cedo']

  'Time: 0.0023202896118164062'

The time is not suppose to be so exact for these cases. For more exact statistic we may try it out much more times and make a mean, for example.

Some verbs weren't found, but we would lemmatize then properly:

print(lemmatize('recomendarias'))
print(lemmatize('tê-lo'))
print(lemmatize('fosse'))
print(lemmatize('estava'))
print(lemmatize('apreciava-a'))

Output:

'''recomendar
    ter
    ir
    estar
    apreciar'''

Authors

@IgorCaetano

Used by

This project is used in the text pre-processing stage in the WOKE project of the Grupo de Estudos e Pesquisa em IA e História ("Study and Research Group on AI and History") at UFSC ("Federal University of Santa Catarina").

Special credits to:

Base TeP 2.0 database
conjucagao.com.br website

References

DIAS-DA-SILVA, B.C.; MORAES, H.R.; OLIVEIRA, M.F.; HASEGAWA, R.; AMORIM, D.A.; PASCHOALINO, C.; NASCIMENTO, A.C. (2000). Construção de um thesaurus eletrônico para o português do Brasil. PROCESSAMENTO COMPUTACIONAL DO PORTUGUÊS ESCRITO E FALADO (PROPOR), Vol. 4, pp. 1-10.

DIAS-DA-SILVA, B.C.; MORAES, H.R. (2003). A construção de um thesaurus eletrônico para o português do Brasil. ALFA, Vol. 47, N. 2, pp. 101-115.
MAZIERO, E.G.; PARDO, T.A.S.; DI FELIPPO, A.; DIAS-DA-SILVA, B.C. (2008). A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil. VI WORKSHOP EM TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (TIL), pp. 390-392.
OLIVEIRA H.G.; Santos D.; Gomes P. (2008). Extracção de relações semânticas entre palavras a partir de um dicionário: primeira avaliação. ENVIADO PARA APRECIAÇÃO A LINGUAMÁTICA 3 (2010).
BARROS, C. D. Antonímia nos adjetivos descritivos do português do Brasil: uma proposta de análise e representação. 2010. 89 f. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2010.

VERBOS. In: CONJUGAÇÃO. 7Graus, c2024. Disponível em: https://www.conjugacao.com.br/verbos-populares/. Acesso em: 11 abril 2024.

Project details

Release history Release notifications | RSS feed

This version

0.1.7

Apr 14, 2024

0.1.6

Apr 13, 2024

0.1.5

Apr 13, 2024

0.1.4

Apr 13, 2024

0.1.3

Apr 13, 2024

0.1.2

Apr 13, 2024

0.1.1

Apr 9, 2024

0.1.0

Apr 9, 2024

0.0.99

Mar 14, 2024

0.0.98

Mar 14, 2024

0.0.97

Mar 14, 2024

0.0.95

Mar 14, 2024

0.0.92

Mar 14, 2024

0.0.91

Mar 14, 2024

0.0.9.2

Apr 9, 2024

0.0.9.1

Apr 9, 2024

0.0.9

Mar 14, 2024

0.0.8

Mar 14, 2024

0.0.7

Mar 14, 2024

0.0.6

Mar 14, 2024

0.0.5

Mar 14, 2024

0.0.4

Mar 14, 2024

0.0.3

Mar 14, 2024

0.0.2

Mar 14, 2024

0.0.1

Mar 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pt_br_verbs_lemmatizer-0.1.7.tar.gz (11.7 MB view details)

Uploaded Apr 14, 2024 Source

Built Distribution

pt_br_verbs_lemmatizer-0.1.7-py3-none-any.whl (11.9 MB view details)

Uploaded Apr 14, 2024 Python 3

File details

Details for the file pt_br_verbs_lemmatizer-0.1.7.tar.gz.

File metadata

Download URL: pt_br_verbs_lemmatizer-0.1.7.tar.gz
Upload date: Apr 14, 2024
Size: 11.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for pt_br_verbs_lemmatizer-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`20ace1aa197852abcad2e323bd994b6af344f607b311a026852836d95044d44f`
MD5	`88a762b3a20e58c91279cd7d2a1c529a`
BLAKE2b-256	`dcdf7fbd05abc1eebc0b2486d11b2704368a99e562d5949a784a5806730e48ee`

See more details on using hashes here.

File details

Details for the file pt_br_verbs_lemmatizer-0.1.7-py3-none-any.whl.

File metadata

Download URL: pt_br_verbs_lemmatizer-0.1.7-py3-none-any.whl
Upload date: Apr 14, 2024
Size: 11.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for pt_br_verbs_lemmatizer-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc4d00beca66cdcd28ed1bd06ed2aa97a5261757fcd0ccad398c5ef3d785e33e`
MD5	`86528168da543f7ba983af7322e3bd87`
BLAKE2b-256	`ed9b72ee5b9188c76e098da0121e88f5e273dea7b323857bab6b93c9567b917d`