A simple lemmatizer based on Unitex word lists
This is a simple module for lemmatization based on the Unitex inflected word list. As such, it needs a Unitex vocabulary file in order to work properly.
So far, I’ve only worked with Portuguese, with the DELAF_PB file provided by NILC.
You can either clone the repository and install with
$ python setup.py install
or install through pip
$ pip install unitexlemmatizer
In order to use the Unitex Lemmatizer, you need to tell it where the word list is:
>>> import unitexlemmatizer as ul >>> ul.load_unitex_dictionary('/path/to/delaf.dic')
Then, you can call the get_lemma function passing the inflected word and its part of speech tag (from the Universal Dependencies tagset).
>>> ul.get_lemma('corpora', 'noun') 'corpus'
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for unitexlemmatizer-1.0.0-py2.py3-none-any.whl
Hashes for unitexlemmatizer-1.0.0-py2.7.egg