Skip to main content

IPA tokeniser

Project description

A simple IPA tokeniser, as simple as in:

>>> from ipatok import tokenise
>>> tokenise('ˈtiːt͡ʃə')
['t', 'iː', 't͡ʃ', 'ə']
>>> tokenise('ʃːjeq͡χːʼjer')
['ʃː', 'j', 'e', 'q͡χːʼ', 'j', 'e', 'r']

api

tokenise(string, strict=True) takes an IPA string and returns a list of tokens. A token usually consists of a single letter together with its accompanying diacritics. If two letters are connected by a tie bar, they are also considered a single token. Except for length markers, suprasegmentals are excluded from the output. Whitespace is also ignored.

By default the function raises a ValueError if the string does not conform to the IPA spec (the 2015 revision). Invoking it with strict=False makes it accept some common replacements such as g and ɫ.

tokenize(string, strict=True) is an alias for tokenise.

installation

This is a standard Python 3 package without dependencies. It is offered at the Cheese Shop, so you can install it with pip:

pip install ipatok

or, alternatively, you can clone this repo (safe to delete afterwards) and do:

python setup.py test
python setup.py install

Of course, this could be happening within a virtualenv/venv as well.

similar projects

licence

MIT. Do as you please and praise the snake gods.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipatok-0.0.1.tar.gz (5.7 kB view hashes)

Uploaded Source

Built Distribution

ipatok-0.0.1-py3-none-any.whl (8.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page