Program designed to lemmatize the various verbal inflections present in the Brazilian Portuguese language quickly and efficiently.
Project description
Verb lemmatizer for brazilian portuguese language
This program aims to give the infinitive form of a verb in a very fast and effective way.
Quantitative information about the dataset
- Total number of verbs: 9,233
- Number of regular verbs: 8,941
- Number of irregular verbs: 292
- Total number of verbal inflections: 3,419,728
Usage Examples
This package was designed to be integrated with other PLN tools, in order to just give the infinitive form of a verb, ie. you need a tool to say if a word is or not a verb. To do that part of the process we highly recommend you to use the spaCy lib. Here we have a example to optimaze the tokenization of a sentence using our lemmatizer instead of spacy's with some time waiting results.
How it was built
- First of all we downloaded the Base TeP 2.0 database, which gave us X number of verbs after filtering it.
- After that we went to the list of most popular verbs used on portuguese present on https://www.conjugacao.com.br/verbos-populares/ and web scraped the 5000 verbs there.
- We compare to the list we have from the Base TeP 2.0, adding the ones who doesn't match.
- Then we start web scraping the inflections of all the verbs we got, also using the conjugacao website.
- Some additional steps were taken during the scraping process, we add a bunch of inflections endins to be prepared for almost every cenario (except the wrong writening).
- Some examples of that is the female form of -lo, -o, -no, etc... which are -la, -a, -na, etc...
- Finally we start to build our dictionary architecture to store all that verbs and that could search into it very quickly. Then we just preenchemos, which is available at the folder "dataset".
Observation: It is possible to find some wrong inflection verbs inside our dataset, we try out many ways to be highly prepared, but, as we don't have a portuguese grammar teacher on the team, we may have commited some mistakes, but, just to be clear, we have more than just the common inflection verbs. If you notice any wrong word or some trouble during the execution of this package, please contact us!
Tests against the giant spaCy - lemmatizer - portuguese trained model:
Special credits to:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pt_br_verbs_lemmatizer-0.0.91.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d71b52dfe24c313b4bac09cc72534b76327beea2ae38b4ac884e1fb65f33a7b |
|
MD5 | c2d2fd16c67bcd733acd23ade91e8ddb |
|
BLAKE2b-256 | ea6689f65a8625b1963a407470e05065b0b8abaf78af1f9e04e705f8ec8911ff |
Hashes for pt_br_verbs_lemmatizer-0.0.91-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83a18309484cb591876279ede476455886f1abff56ba1fe887f77de1c993b643 |
|
MD5 | d0ffa212f5be2c45ce7da5050c70dcee |
|
BLAKE2b-256 | 08d9ccdb5769298fe5c4baebd77c9669053d132d2cc49e1be987020ae3a60e48 |