Paquete para PLN de lenguas originarias
Project description
Py-Elotl
Python package for Natural Language Processing (NLP), focused on low-resource languages spoken in Mexico.
This is a project of Comunidad Elotl.
Developed by:
- Paul Aguilar @penserbjorne, paul.aguilar.enriquez@hotmail.com
- Robert Pugh @Lguyogiro, robertpugh408@gmail.com
Requiere python>=3.X
- Development Status
Pre-Alpha
. Read Classifiers - pip package: elotl
- GitHub repository: ElotlMX/py-elotl
Installation
Using pip
pip install elotl
From source
git clone https://github.com/ElotlMX/py-elotl.git
cd py-elotl
pip install -e .
Use
Working with corpus
import elotl.corpus
Listing available corpus
Code:
print("Name\t\tDescription")
list_of_corpus = elotl.corpus.list_of_corpus()
for row in list_of_corpus:
print(row)
Output:
Name Description
['axolotl', 'Is a Spanish-Nahuatl parallel corpus']
['tsunkua', 'Is a Spanish-otomí parallel corpus']
Loading a corpus
If a non-existent corpus is requested, a value of 0 is returned.
axolotl = elotl.corpus.load('axolotlr')
if axolotl == 0:
print("The name entered does not correspond to any corpus")
If an existing corpus is entered, a list is returned.
axolotl = elotl.corpus.load('axolotl')
for row in axolotl:
print(row)
['Hay que adivinar: un pozo, a la mitad del cerro, te vas a encontrar.', 'See tosaasaanil, see tosaasaanil. Tias iipan see tepeetl, iitlakotian tepeetl, tikoonextis san see aameyalli.', '', 'Adivinanzas nahuas']
Each element of the list has four indices:
- non_original_language
- original_language
- variant
- document_name
tsunkua = elotl.corpus.load('tsunkua')
for row in tsunkua:
print(row[0]) # language 1
print(row[1]) # language 2
print(row[2]) # variant
print(row[3]) # document
Una vez una señora se emborrachó
nándi na ra t'u̱xú bintí
Otomí del Estado de México (ots)
El otomí de toluca, Yolanda Lastra
Normalizing nahuatl orthographies
Import the orthography module and Load the axolot nahuatl corpus.
import elotl.corpus
import elotl.nahuatl.orthography
a = elotl.corpus.load("axolotl")
Creates a normalizer object, passing as parameter the normalization to be used.
The following normalizations are currently available:
- sep
- Alphabet often seen in use by the Secretaría de Educación Pública (SEP) and the Instituto Nacional para la Educación de los Adultos (INEA). important characteristics of this alphabet are the use of "u" for the phoneme /w/, "k" for /k/, and "j" for /h/.
- inali
- Alphabet in use by the Instituto Nacional de Lenguas Indígenas. Uses "w" for /w/, "k" for /k/, and "h" for /h/.
- ack
- Alphabet initially used by Richard Andrews and subsequently by a number of other Nahuatl scholars. Named after Andrews, Campbell, and Karttunen. Uses "hu" for /w/, "c" and "qu" for /k/, and "h" for /h/.
If an unsupported normalization is specified, sep will be used by default.
You can use the normalize
method to normalize a text to the selected orthography. And the to_phones
method to get
the phonemes.
>>> n = elotl.nahuatl.orthography.Normalizer("sep")
>>> n.normalize(a[1][1])
'au in ye yujki in on tlenamakak niman ye ik teixpan on motlalia se tlakatl itech mokaua.'
>>> n.to_phones(a[1][1])
'aw in ye yuʔki in on ƛenamakak niman ye ik teiʃpan on moƛalia se ƛakaƛ itet͡ʃ mokawa.'
>>> n = elotl.nahuatl.orthography.Normalizer("inali")
>>> n.normalize(a[1][1])
'aw in ye yuhki in on tlenamakak niman ye ik teixpan on motlalia se tlakatl itech mokawa.'
>>> n.to_phones(a[1][1])
'aw in ye yuʔki in on ƛenamakak niman ye ik teiʃpan on moƛalia se ƛakaƛ itet͡ʃ mokawa.'
>>> n = elotl.nahuatl.orthography.Normalizer("ack")
>>> n.normalize(a[1][1])
'auh in ye yuhqui in on tlenamacac niman ye ic teixpan on motlalia ce tlacatl itech mocahua.'
>>> n.to_phones(a[1][1])
'aw in ye yuʔki in on ƛenamakak niman ye ik teiʃpan on moƛalia se ƛakaƛ itet͡ʃ mokawa.'
Package structure
The following structure is a reference. As the package grows it will be better documented.
elotl/ Top-level package
__init__.py Initialize the package
corpora/ Here are the corpus data
corpus/ Subpackage to load corpus
nahuatl/ Nahuatl language subpackage
orthography.py Module to normalyze nahuatl orthography and phonemas
utils/ Subpackage with useful functions and files
fst/ Finite State Transducer functions
att/ Module with static .att files
test/ Unit test scripts
Development
Requirements
- python3
- HFST
- GNU make
- virtualenv
- Python packages
- setuptools
- wheel
Quick build
virtualenv --python=/usr/bin/python3 venv
source venv/bin/activate
make all
Step by step
Build FSTs
Build the FSTs with make
.
make fst
Create a virtual environment and activate it.
virtualenv --python=/usr/bin/python3 venv
source venv/bin/activate
Update pip
and generate distribution files.
python -m pip install --upgrade pip
python -m pip install --upgrade setuptools wheel
rm -rf build/ dist/
python setup.py clean sdist bdist_wheel
Testing the package locally
python -m pip install -e .
Send to PyPI
python -m pip install twine
twine upload dist/*
License
Mozilla Public License 2.0 (MPL 2.0)
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.