Program designed to lemmatize the various verbal inflections present in the Brazilian Portuguese language quickly and efficiently.

Project description

Verb lemmatizer for brazilian portuguese language

This program aims to give the infinitive form of a verb in a very fast and effective way on portuguese-BR texts.

Quantitative information about the dataset

Total number of verbs: 9,233
Number of regular verbs: 8,941
Number of irregular verbs: 292
Total number of verbal inflections: 3,419,728

Installation

This package is installed using the command "pip install"

pip install pt-br-verbs-lemmatizer

For more information about this package, see it on: Pypi

Usage Examples

This package was designed to be integrated with other NLP tools, in order to say if a word is or is not a verb we highly recommend you to use the spaCy lib model trained on portuguese corpus.

Simple usage

  from pt_br_verbs_lemmatizer import lemmatize

  verb = 'apresentava'

  verb_lemma = lemmatize(verb)

  print(verb_lemma)

Output:

  'apresentar'

Execution time

  from pt_br_verbs_lemmatizer import lemmatize
  import time

  verb = 'apresentá-lo-ia'

  t1 = time.time()

  verb_lemma = lemmatize(verb)

  time.sleep(0.1)

  t2 = time.time()

  duration = round((t2-t1-0.1),8)

  print(verb_lemma)
  print(f'Duration: {duration} seconds')

Output:

  'apresentar'
  'Duration: 0.00047889 seconds'

How it was built

First of all we downloaded the Base TeP 2.0 database, which gave us X number of verbs after filtering it.
After that we went to the list of most popular verbs used on portuguese present on https://www.conjugacao.com.br/verbos-populares/ and web scraped the 5000 verbs there.
We compare to the list we had from the Base TeP 2.0, adding the ones who doesn't match.
Then we start web scraping the inflections of all the verbs we got, also using the conjugacao website.
Some additional steps were taken during the scraping process, we add a bunch of inflections endins to be prepared for almost every cenario (except the wrong writening).
Some examples of that is the female form of -lo, -o, -no, etc... which are -la, -a, -na, etc...
Finally we start to build our dictionary architecture to store all that verbs and that could search into it very quickly. Then we just fill it, which is available at the folder "dataset".

Observation: It is possible to find some wrong inflection verbs inside our dataset, we try out many ways to be highly prepared, but, as we don't have a portuguese grammar teacher on board, we may have committed some mistakes. But, just to be clear, we have more than just the common inflection verbs. If you notice any wrong word or some trouble during the execution of this package, please contact us!

Tests against the giant spaCy - lemmatizer - portuguese trained model:

Now we are going to see some tests related to the results spaCy has in his lemmatization and the execution time to find that lemmatized verb, comparing to our program.

Installing and importing spaCy process (click to expand)

pip install -U spacy
pip install -U spacy-lookups-data
python -m spacy download pt_core_news_lg

import spacy
nlp = spacy.load('pt_core_news_lg',enable=["tok2vec","lemmatizer","morphologizer"])

    from pt_br_verbs_lemmatizer import lemmatize

    texto = '''Hoje vou jogar bola e espero que você esteja saindo com seus amigos também.
    Gostaria de abrir a janela, será que você vê o céu? Quero apresentá-la para meus pais.
    Eu tinha duas casas, agora só consigo ter uma. Eu apresentá-la-ia para vocês ontem!
    Olhando para ele que observava ela.'''

    doc = nlp(texto)

    for token in doc:

    if token.pos_ == 'VERB':
        print('Verb identified:',token.orth_)
        t1 = time.time()
        verb_lemma_spacy = token.lemma_
        time.sleep(0.1)
        t2 = time.time()

        duration_spacy = round((t2-t1-0.1),8)

        print('spaCy:',verb_lemma_spacy,duration_spacy,'seconds.')

        t1 = time.time()
        verb_lemma_mine = lemmatize(token.orth_)
        time.sleep(0.1)
        t2 = time.time()

        duration_mine = round((t2-t1-0.1),8)
        
        print('Mine:',verb_lemma_mine,duration_mine,'seconds.')

        print('-'*40)

Full Output (click to expand)

  '''Verb identified: jogar
    spaCy: jogar 0.00021591 seconds.
    Mine: jogar 0.00191703 seconds.
    ----------------------------------------
    Verb identified: espero
    spaCy: esperar 0.00014153 seconds.
    Mine: esperar 0.00021949 seconds.
    ----------------------------------------
    Verb identified: saindo
    spaCy: sair 0.00013509 seconds.
    Mine: sair 0.0001792 seconds.
    ----------------------------------------
    Verb identified: Gostaria
    spaCy: Gostaria 0.00014081 seconds.
    Mine: gostar 0.00018969 seconds.
    ----------------------------------------
    Verb identified: abrir
    spaCy: abrir 0.0001389 seconds.
    Mine: abrir 0.00023022 seconds.
    ----------------------------------------
    Verb identified: será
    spaCy: ser 0.00020018 seconds.
    Mine: ser 0.00017014 seconds.
    ----------------------------------------
    Verb identified: vê
    spaCy: ver 6.261e-05 seconds.
    Mine: ver 0.00018539 seconds.
    ----------------------------------------
    Verb identified: Quero
    spaCy: querer 0.00096145 seconds.
    Mine: querer 0.0001966 seconds.
    ----------------------------------------
    Verb identified: apresentá-la
    spaCy: apresentá-la 0.00013962 seconds.
    Mine: apresentar 0.00027146 seconds.
    ----------------------------------------
    Verb identified: tinha
    spaCy: ter 0.00013342 seconds.
    Mine: ter 0.00016847 seconds.
    ----------------------------------------
    Verb identified: consigo
    spaCy: consigo 0.00016179 seconds.
    Mine: conseguir 0.00019159 seconds.
    ----------------------------------------
    Verb identified: ter
    spaCy: ter 0.00014439 seconds.
    Mine: ter 0.00023308 seconds.
    ----------------------------------------
    Verb identified: apresentá-la-ia
    spaCy: apresentá-la-ia 5.569e-05 seconds.
    Mine: apresentar 0.00023594 seconds.
    ----------------------------------------
    Verb identified: Olhando
    spaCy: Olhando 0.00017633 seconds.
    Mine: olhar 0.00023808 seconds.
    ----------------------------------------
    Verb identified: observava
    spaCy: observar 0.00013556 seconds.
    Mine: observar 0.00020494 seconds.
    ----------------------------------------'''

So, as we can see, although spaCy has better searching times (but we are very close to it), many times it mistakes the lemmatized verbs. To be honest, for my personal tests, almost every time a verb has hyphen "-" spaCy starts to make some confusion.

I want to make it clear: spaCy is one of, if not the, best NLP library available at the moment. What I tried to do was improve the replacements of the inflected verb for the infinitive verb. So, if you want to lemmatize your verbs with much more accuracy I suggest you mix the spaCy and pt-br-verbs-lemmatizer to get the bests results on your portuguese-BR texts!

Tokenizing using spaCy's lemmatizer

    texto = '''Tem-se que ter muito cuidado com isso. Tu recomendarias o que? 
    Ele apresentava-se como queria. Foi bom tê-lo por perto!
    Tu fosse no show ontem? Eu estava olhando e apreciava-a muito.
    Esperava-se que ele chegaria mais cedo.'''

    doc = nlp(texto)

    tokenization = []

    print('Verbs:')

    t1 = time.time()

    for token in doc:
    token_text = token.orth_
    if not (token.is_punct or token.is_space):
        if token.pos_ == 'VERB':
        print(token_text)
        token_text = token.lemma_
        tokenization.append(token_text.lower())

    t2 = time.time()

    print('\n')
    print(tokenization)
    print(f'\nTime: {t2-t1}')

Output:

    '''Verbs:
      Tem-se
      ter
      Tu
      apresentava-se
      queria
      tê-lo
      olhando
      apreciava-a
      Esperava-se
      chegaria'''


      ['tem-se', 'que', 'ter', 'muito', 'cuidado', 'com', 'isso', 'tu', 
      'recomendarias', 'o', 'que', 'ele', 'apresentar se', 'como', 'querer', 
      'foi', 'bom', 'ter ele', 'por', 'perto', 'tu', 'fosse', 'no', 'show', 
      'ontem', 'eu', 'estava', 'olhar', 'e', 'apreciava-r', 'muito', 'esperava-se', 
      'que', 'ele', 'chegar', 'mais', 'cedo']

      'Time: 0.0021452903747558594'

Tokenizing using our lemmatizer

    texto = '''Tem-se que ter muito cuidado com isso. Tu recomendarias o que? 
    Ele apresentava-se como queria. Foi bom tê-lo por perto!
    Tu fosse no show ontem? Eu estava olhando e apreciava-a muito.
    Esperava-se que ele chegaria mais cedo.'''

    doc = nlp(texto)

    tokenization = []

    print('Verbs:')

    t1 = time.time()

    for token in doc:
    token_text = token.orth_
    if not (token.is_punct or token.is_space):
        if token.pos_ == 'VERB':
        print(token_text)
        token_text = lemmatize(token_text)
        tokenization.append(token_text.lower())

    t2 = time.time()

    print('\n')
    print(tokenization)
    print(f'\nTime: {t2-t1}')

Output:

    '''Verbs:
      Tem-se
      ter
      Tu
      apresentava-se
      queria
      tê-lo
      olhando
      apreciava-a
      Esperava-se
      chegaria'''


      ['ter', 'que', 'ter', 'muito', 'cuidado', 'com', 'isso', 'tu', 
      'recomendarias', 'o', 'que', 'ele', 'apresentar', 'como', 'querer', 
      'foi', 'bom', 'ter', 'por', 'perto', 'tu', 'fosse', 'no', 'show', 
      'ontem', 'eu', 'estava', 'olhar', 'e', 'apreciar', 'muito', 'esperar', 
      'que', 'ele', 'chegar', 'mais', 'cedo']

      'Time: 0.0023202896118164062'

The time is not suppose to be so exact for these cases. For more exact statistic we may try it out much more times and make a mean, for example.

Some verbs weren't found, but we would lemmatize then properly:

    print(lemmatize('recomendarias'))
    print(lemmatize('tê-lo'))
    print(lemmatize('fosse'))
    print(lemmatize('estava'))
    print(lemmatize('apreciava-a'))

Output:

    >>>'''recomendar
        ter
        ir
        estar
        apreciar'''

Authors

@IgorCaetano

Special credits to:

Base TeP 2.0 database
conjucagao.com.br website

Used by

This project is used in the text pre-processing stage in the WOKE project of the Grupo de Estudos e Pesquisa em IA e História ("Study and Research Group on AI and History") at UFSC ("Federal University of Santa Catarina").

Project details

Release history Release notifications | RSS feed

0.1.7

Apr 14, 2024

0.1.6

Apr 13, 2024

0.1.5

Apr 13, 2024

0.1.4

Apr 13, 2024

0.1.3

Apr 13, 2024

0.1.2

Apr 13, 2024

0.1.1

Apr 9, 2024

This version

0.1.0

Apr 9, 2024

0.0.99

Mar 14, 2024

0.0.98

Mar 14, 2024

0.0.97

Mar 14, 2024

0.0.95

Mar 14, 2024

0.0.92

Mar 14, 2024

0.0.91

Mar 14, 2024

0.0.9.2

Apr 9, 2024

0.0.9.1

Apr 9, 2024

0.0.9

Mar 14, 2024

0.0.8

Mar 14, 2024

0.0.7

Mar 14, 2024

0.0.6

Mar 14, 2024

0.0.5

Mar 14, 2024

0.0.4

Mar 14, 2024

0.0.3

Mar 14, 2024

0.0.2

Mar 14, 2024

0.0.1

Mar 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pt_br_verbs_lemmatizer-0.1.0.tar.gz (11.4 MB view hashes)

Uploaded Apr 9, 2024 Source

Built Distribution

pt_br_verbs_lemmatizer-0.1.0-py3-none-any.whl (11.6 MB view hashes)

Uploaded Apr 9, 2024 Python 3

Hashes for pt_br_verbs_lemmatizer-0.1.0.tar.gz

Hashes for pt_br_verbs_lemmatizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5a8a00ba26829c9e4662d952e7263ed6f942ed10dc9129a64eda3b8b4d03c1d9`
MD5	`a4c1a9e095aa2ada87c901018b20ae13`
BLAKE2b-256	`6fb6272a7e0e9d671ff5670187769fd5501653fd1a7ec157a6c7309f1a360b2a`

Hashes for pt_br_verbs_lemmatizer-0.1.0-py3-none-any.whl

Hashes for pt_br_verbs_lemmatizer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c1a39aea2cbf7187742d33e2d611648e2d5b0616ccc02bffa221badfe7ed163`
MD5	`7a1ddb614643949488e4fd710dfbe84c`
BLAKE2b-256	`51f2a2ca84e3a0558959e948d5d5aa440c485635fe81782430569f82791417aa`