Program designed to lemmatize the various verbal inflections present in the Brazilian Portuguese language quickly and efficiently.

Project description

Verb lemmatizer for brazilian portuguese language

This program aims to give the infinitive form of a verb in a very fast and effective way.

Quantitative information about the dataset

Total number of verbs: 9,233
Number of regular verbs: 8,941
Number of irregular verbs: 292
Total number of verbal inflections: 3,419,728

Usage Examples

This package was designed to be integrated with other PLN tools, in order to just give the infinitive form of a verb, ie. you need a tool to say if a word is or not a verb. To do that part of the process we highly recommend you to use the spaCy lib. Here we have a example to optimaze the tokenization of a sentence using our lemmatizer instead of spacy's with some time waiting results.

How it was built

First of all we downloaded the Base TeP 2.0 database, which gave us X number of verbs after filtering it.
After that we went to the list of most popular verbs used on portuguese present on https://www.conjugacao.com.br/verbos-populares/ and web scraped the 5000 verbs there.
We compare to the list we have from the Base TeP 2.0, adding the ones who doesn't match.
Then we start web scraping the inflections of all the verbs we got, also using the conjugacao website.
Some additional steps were taken during the scraping process, we add a bunch of inflections endins to be prepared for almost every cenario (except the wrong writening).
Some examples of that is the female form of -lo, -o, -no, etc... which are -la, -a, -na, etc...
Finally we start to build our dictionary architecture to store all that verbs and that could search into it very quickly. Then we just preenchemos, which is available at the folder "dataset".

Observation: It is possible to find some wrong inflection verbs inside our dataset, we try out many ways to be highly prepared, but, as we don't have a portuguese grammar teacher on the team, we may have commited some mistakes, but, just to be clear, we have more than just the common inflection verbs. If you notice any wrong word or some trouble during the execution of this package, please contact us!

Tests against the giant spaCy - lemmatizer - portuguese trained model:

Special credits to:

Base TeP 2.0 database
conjucagao.com.br website

Project details

Release history Release notifications | RSS feed

0.1.7

Apr 14, 2024

0.1.6

Apr 13, 2024

0.1.5

Apr 13, 2024

0.1.4

Apr 13, 2024

0.1.3

Apr 13, 2024

0.1.2

Apr 13, 2024

0.1.1

Apr 9, 2024

0.1.0

Apr 9, 2024

0.0.99

Mar 14, 2024

0.0.98

Mar 14, 2024

0.0.97

Mar 14, 2024

0.0.95

Mar 14, 2024

0.0.92

Mar 14, 2024

This version

0.0.91

Mar 14, 2024

0.0.9.2

Apr 9, 2024

0.0.9.1

Apr 9, 2024

0.0.9

Mar 14, 2024

0.0.8

Mar 14, 2024

0.0.7

Mar 14, 2024

0.0.6

Mar 14, 2024

0.0.5

Mar 14, 2024

0.0.4

Mar 14, 2024

0.0.3

Mar 14, 2024

0.0.2

Mar 14, 2024

0.0.1

Mar 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pt_br_verbs_lemmatizer-0.0.91.tar.gz (11.7 MB view hashes)

Uploaded Mar 14, 2024 Source

Built Distribution

pt_br_verbs_lemmatizer-0.0.91-py3-none-any.whl (12.0 MB view hashes)

Uploaded Mar 14, 2024 Python 3

Hashes for pt_br_verbs_lemmatizer-0.0.91.tar.gz

Hashes for pt_br_verbs_lemmatizer-0.0.91.tar.gz
Algorithm	Hash digest
SHA256	`6d71b52dfe24c313b4bac09cc72534b76327beea2ae38b4ac884e1fb65f33a7b`
MD5	`c2d2fd16c67bcd733acd23ade91e8ddb`
BLAKE2b-256	`ea6689f65a8625b1963a407470e05065b0b8abaf78af1f9e04e705f8ec8911ff`

Hashes for pt_br_verbs_lemmatizer-0.0.91-py3-none-any.whl

Hashes for pt_br_verbs_lemmatizer-0.0.91-py3-none-any.whl
Algorithm	Hash digest
SHA256	`83a18309484cb591876279ede476455886f1abff56ba1fe887f77de1c993b643`
MD5	`d0ffa212f5be2c45ce7da5050c70dcee`
BLAKE2b-256	`08d9ccdb5769298fe5c4baebd77c9669053d132d2cc49e1be987020ae3a60e48`