Skip to main content

Polish FST Inverse Text Normalization

Project description

pl_itn

Inverse Text Normalization is an NLP task of changing the spoken form of a phrase to written form, for example:

one two three -> 1 2 3

pdm-managed

pl_itn is an opensource Polish ITN Python library and REST API for practical applications.

This project is an implementation of NeMo Inverse Text Normalization for Polish.

Table of contents

Prerequisites
Setup
Docker
Usage
gRPC service
Building custom grammars
Documentation
Contributing
License
References

Prerequisites

For pynini

  • A standards-compliant C++17 compiler (GCC >= 7 or Clang >= 700)
  • The compatible recent version of OpenFst built with the grm extensions (see deps/install_openfst.md)

Setup

Make sure to first install prerequisites, especially OpenFST.

Install from PyPI

pip install pl_itn

Build from source

pip install .

Editable install for development

pip install -e .[dev]

Docker

To build docker image containing pl_itn library use pl_itn_lib.dockerfile file.
To build docker image with gRPC service use grpc_service.dockerfile file.

docker build -t <IMAGE:TAG> -f <DOCKERFILE> .

Usage

Console app

usage: pl_itn [-h] (-t TEXT | -i) [--tagger TAGGER] [--verbalizer VERBALIZER] [--config CONFIG]
              [--log-level {debug,info}]

Inverse Text Normalization based on Finite State Transducers

options:
  -h, --help            show this help message and exit
  -t TEXT, --text TEXT  Input text
  -i, --interactive     If used, demo will process phrases from stdin interactively.
  --tagger TAGGER
  --verbalizer VERBALIZER
  --config CONFIG       Optionally provide yaml config with tagger and verbalizer paths.
  --log-level {debug,info}
                        return a step back value.
pl_itn -t "jest za pięć druga"
jest 01:55

pl_itn -t "drugi listopada dwa tysiące osiemnastego roku"
2 listopada 2018 roku

Python

>>> from pl_itn import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize("za pięć dwunasta")
'11:55'

Docker

Existing docker image containing pl_itn library is required. For build command refer to Docker section.

docker run --rm -it <IMAGE:TAG> --help

gRPC Service

gRPC service methods are described in grpc_service/pl_itn_api/api.proto file. Docker container is suggested approach for running the service. For build command refer to Docker section. Service within container serves on port 10010.

Example of building the image and starting the service.

docker build -t pl_itn_service:test -f grpc_service.dockerfile .
docker run -p 10010:10010 pl_itn_service:test

Building custom grammars

Custom grammars can be built using build_grammar/build_grammar.py script.

There are three demo grammars available:

  • not declined cardinal numbers (e.g. "jeden", "dwa", "trzy")
  • declined cardinal numbers (e.g. "jednego", "dwóch", "trzech")
  • ordinal numbers (e.g. "pierwszy", "druga", "trzecie")

Normalization types can be included and excluded from the grammar through the config file, which is set by default to build_grammar/grammar_config.yaml.

# cardinals_basic_forms: True
# cardinals_declined: True
# ordinals: True

$ python3 build_grammar/build_grammar.py --grammars-dir all

$ pl_itn \
  --tagger all/tagger.fst \
  --verbalizer all/verbalizer.fst \
  -t "Jeden trzech piąta"

1 3 5
# cardinals_basic_forms: True
# cardinals_declined: False
# ordinals: True

$ python3 build_grammar/build_grammar.py --grammars-dir cardinals_basic_ordinals

$ pl_itn \
  --tagger cardinals_basic_ordinals/tagger.fst \
  --verbalizer cardinals_basic_ordinals/verbalizer.fst \
  -t "Jeden trzech piąta"

1 trzech 5
# cardinals_basic_forms: True
# cardinals_declined: False
# ordinals: False

$ python3 build_grammar/build_grammar.py --grammars-dir only_basic_cardinals

$ pl_itn \
  --tagger only_basic_cardinals/tagger.fst \
  --verbalizer only_basic_cardinals/verbalizer.fst \
  -t "Jeden trzech piąta"

1 trzech piąta

See Documentation for more details.

Documentation

Contributing

License

Rerences

  • K. Gorman. 2016. Pynini: A Python library for weighted finite-state grammar compilation. In Proc. ACL Workshop on Statistical NLP and Weighted Automata, 75-80.
  • Y. Zhang, E. Bakhturina, K. Gorman, and B. Ginsburg. 2021. NeMo Inverse Text Normalization: From Development To Production.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pl_itn-0.1.0rc1.tar.gz (647.2 kB view details)

Uploaded Source

Built Distribution

pl_itn-0.1.0rc1-py3-none-any.whl (670.0 kB view details)

Uploaded Python 3

File details

Details for the file pl_itn-0.1.0rc1.tar.gz.

File metadata

  • Download URL: pl_itn-0.1.0rc1.tar.gz
  • Upload date:
  • Size: 647.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for pl_itn-0.1.0rc1.tar.gz
Algorithm Hash digest
SHA256 6afd5609ea0e06bf95cd35a53a356dac38cfb4899963dc3315b3860ddb16da92
MD5 4795aba0dc43ba4fb0418beea86b4cea
BLAKE2b-256 0922e5165b2f4545c3dacc9e229d21a0d8d7b39915abbfc94df9a3897bdd37e2

See more details on using hashes here.

File details

Details for the file pl_itn-0.1.0rc1-py3-none-any.whl.

File metadata

  • Download URL: pl_itn-0.1.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 670.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for pl_itn-0.1.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 c029b517e9e9810abbc7847d0388e17d0577c81a26ef3b2f9147b4d0cf590c0d
MD5 8fe5366e3aa22f9b1319fc2bca820f74
BLAKE2b-256 24d1ea94d3d52f57b9060a0ad2d7ee737055a2494e903103de81d35823c6612d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page