Skip to main content

Paquete para PLN de lenguas originarias

Project description

Py-Elotl

Python package for Natural Language Processing (NLP), focused on low-resource languages spoken in Mexico.

This is a project of Comunidad Elotl.

Developed by:

Requiere python>=3.X

Installation

Using pip

pip install elotl

From source

git clone https://github.com/ElotlMX/py-elotl.git
cd py-elotl
pip install -e .

Use

Working with corpus

import elotl.corpus

Listing available corpus

Code:

print("Name\t\tDescription")
list_of_corpus = elotl.corpus.list_of_corpus()
for row in list_of_corpus:
    print(row)

Output:

Name		Description
['axolotl', 'Is a Spanish-Nahuatl parallel corpus']
['tsunkua', 'Is a Spanish-otomí parallel corpus']

Loading a corpus

If a non-existent corpus is requested, a value of 0 is returned.

axolotl = elotl.corpus.load('axolotlr')
if axolotl == 0:
    print("The name entered does not correspond to any corpus")

If an existing corpus is entered, a list is returned.

axolotl = elotl.corpus.load('axolotl')
for row in axolotl:
    print(row)
['Hay que adivinar: un pozo, a la mitad del cerro, te vas a encontrar.', 'See tosaasaanil, see tosaasaanil. Tias iipan see tepeetl, iitlakotian tepeetl, tikoonextis san see aameyalli.', '', 'Adivinanzas nahuas']

Each element of the list has four indices:

  • non_original_language
  • original_language
  • variant
  • document_name
tsunkua = elotl.corpus.load('tsunkua')
  for row in tsunkua:
      print(row[0]) # language 1
      print(row[1]) # language 2
      print(row[2]) # variant
      print(row[3]) # document
Una vez una señora se emborrachó
nándi na ra t'u̱xú bintí
Otomí del Estado de México (ots)
El otomí de toluca, Yolanda Lastra

Normalizing nahuatl orthographies

Import the orthography module and Load the axolot nahuatl corpus.

import elotl.corpus
import elotl.nahuatl.orthography
a = elotl.corpus.load("axolotl")

Creates a normalizer object, passing as parameter the normalization to be used.

The following normalizations are currently available:

  • sep-u-j
  • sep-w-h
  • ack

If an unsupported normalization is specified, sep-u-j will be used by default.

You can use the normalize method to normalize a text to the selected orthography. And the to_phones method to get the phonemes.

>>> n = elotl.nahuatl.orthography.Normalizer("sep-u-j")
>>> n.normalize(a[1][1])
'au in ye yujki in on tlenamakak niman ye ik teixpan on motlalia se tlakatl itech mokaua.'
>>> n.to_phones(a[1][1])
'aw in ye yuʔki in on ƛenamakak niman ye ik teiʃpan on moƛalia se ƛakaƛ itet͡ʃ mokawa.'
>>> n = elotl.nahuatl.orthography.Normalizer("sep-w-h")
>>> n.normalize(a[1][1])
'aw in ye yuhki in on tlenamakak niman ye ik teixpan on motlalia se tlakatl itech mokawa.'
>>> n.to_phones(a[1][1])
'aw in ye yuʔki in on ƛenamakak niman ye ik teiʃpan on moƛalia se ƛakaƛ itet͡ʃ mokawa.'
>>> n = elotl.nahuatl.orthography.Normalizer("ack")
>>> n.normalize(a[1][1])
'auh in ye yuhqui in on tlenamacac niman ye ic teixpan on motlalia ce tlacatl itech mocahua.'
>>> n.to_phones(a[1][1])
'aw in ye yuʔki in on ƛenamakak niman ye ik teiʃpan on moƛalia se ƛakaƛ itet͡ʃ mokawa.'

Package structure

The following structure is a reference. As the package grows it will be better documented.

elotl/                              Top-level package
          __init__.py               Initialize the package
          corpora/                  Here are the corpus data
          corpus/                   Subpackage to load corpus     
          nahuatl/                  Nahuatl language subpackage
                  orthography.py    Module to normalyze nahuatl orthography and phonemas
          utils/                    Subpackage with useful functions and files
                  fst/              Finite State Transducer functions
                        att/        Module with static .att files
test/                               Unit test scripts

Development

Build FSTs

Requires HFST to be installed. Install it and build the FSTs with make.

make all

Create a virtual environment and activate it.

virtualenv --python=/usr/bin/python3 venv
source venv/bin/activate

Update pip and generate distribution files.

python -m pip install --upgrade pip
python -m pip install --upgrade setuptools wheel
python setup.py clean sdist bdist_wheel

Testing the package locally

python -m pip install -e .

Send to PyPI

python -m pip install twine
twine upload dist/*

License

Mozilla Public License 2.0 (MPL 2.0)

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elotl-0.0.1.15.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

elotl-0.0.1.15-py3-none-any.whl (2.2 MB view details)

Uploaded Python 3

File details

Details for the file elotl-0.0.1.15.tar.gz.

File metadata

  • Download URL: elotl-0.0.1.15.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.5

File hashes

Hashes for elotl-0.0.1.15.tar.gz
Algorithm Hash digest
SHA256 ae4bba359e4427a2b958efc7278d8d6d633c7f67b49c51ff87b6c15f3ec6283f
MD5 c71ca3672bcf8ea5d2df064e20af583c
BLAKE2b-256 403fc306ac195a4e5b2b592d804164b8b281506dc48eb6eb710a0cab9152d09f

See more details on using hashes here.

File details

Details for the file elotl-0.0.1.15-py3-none-any.whl.

File metadata

  • Download URL: elotl-0.0.1.15-py3-none-any.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.5

File hashes

Hashes for elotl-0.0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 6359cacdcc084c1fc5175cf05bc9e85cac26362af28f3a7e706a4ecc6eb3596d
MD5 97b87c93a560a68f77b8e0972b15e0d3
BLAKE2b-256 549b35a459452f92e2daef6cfb1494ab7deed117b280377cf351879d1e097aa9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page