Skip to main content

Converts English tokens into the equivalent Sinhala representation using IPA (International Phonetic Alphabet)

Project description

SEETM (Sinhala-English Equivalent Token Mapper) allows creating equivalent token maps and replace them with a base token to avoid OOV tokens and generate a single feature for all equivalent tokens in a Sinhala-English code-switching dataset in rasa-based conversational AIs.

Features

  • Allows mapping multiple equivalent tokens into a base token
  • Fully supports rasa 2.8.x projects
  • Provides an easy-to-use CLI
  • Provides an efficient server-based GUI
  • Provides a fully-functional custom whitespace tokenizer
  • Fully-supports Sinhala in the GUI

What's Cooking?

  • Mapping suggestions in the SEETM server GUI
  • Automatically generated mappings

Limitations and Known Issues

  • Should manually add the SEETM tokenizer to the rasa pipeline or else the token maps are not taking any effect
  • IPA-based suggestions could contain slight changes based on th IPA mapping origin. (SEETM uses CMU)

Resources and References

📒 Docs: https://seetm.github.io
📦 PyPi: https://pypi.org/project/seetm/1.1.1/
🪵 Full Changelog: https://github.com/SEETM-NLP/seetm/blob/main/CHANGELOG.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seetm-1.1.1.tar.gz (964.8 kB view hashes)

Uploaded Source

Built Distribution

seetm-1.1.1-py3-none-any.whl (990.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page