Polish FST Inverse Text Normalization
Project description
pl_itn
Inverse Text Normalization is an NLP task of changing the spoken form of a phrase to written form, for example:
one two three -> 1 2 3
pl_itn
is an opensource Polish ITN Python library and REST API for practical applications.
This project is an implementation of NeMo Inverse Text Normalization for Polish.
Table of contents
Prerequisites
Setup
Docker
Usage
gRPC service
Building custom grammars
Documentation
Contributing
License
References
Prerequisites
For pynini
- A standards-compliant C++17 compiler (GCC >= 7 or Clang >= 700)
- The compatible recent version of OpenFst built with the grm extensions (see
deps/install_openfst.md
)
Setup
Make sure to first install prerequisites, especially OpenFST.
Install from PyPI
pip install pl_itn
Build from source
pip install .
Editable install for development
pip install -e .[dev]
Docker
To build docker image containing pl_itn library use pl_itn_lib.dockerfile
file.
To build docker image with gRPC service use grpc_service.dockerfile
file.
docker build -t <IMAGE:TAG> -f <DOCKERFILE> .
Usage
Console app
usage: pl_itn [-h] (-t TEXT | -i) [--tagger TAGGER] [--verbalizer VERBALIZER] [--config CONFIG]
[--log-level {debug,info}]
Inverse Text Normalization based on Finite State Transducers
options:
-h, --help show this help message and exit
-t TEXT, --text TEXT Input text
-i, --interactive If used, demo will process phrases from stdin interactively.
--tagger TAGGER
--verbalizer VERBALIZER
--config CONFIG Optionally provide yaml config with tagger and verbalizer paths.
--log-level {debug,info}
return a step back value.
pl_itn -t "jest za pięć druga"
jest 01:55
pl_itn -t "drugi listopada dwa tysiące osiemnastego roku"
2 listopada 2018 roku
Python
>>> from pl_itn import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize("za pięć dwunasta")
'11:55'
Docker
Existing docker image containing pl_itn library is required. For build command refer to Docker section.
docker run --rm -it <IMAGE:TAG> --help
gRPC Service
gRPC service methods are described in grpc_service/pl_itn_api/api.proto
file. Docker container is suggested approach for running the service. For build command refer to Docker section.
Service within container serves on port 10010.
Example of building the image and starting the service.
docker build -t pl_itn_service:test -f grpc_service.dockerfile .
docker run -p 10010:10010 pl_itn_service:test
Building custom grammars
Custom grammars can be built using build_grammar/build_grammar.py
script.
There are three demo grammars available:
- not declined cardinal numbers (e.g. "jeden", "dwa", "trzy")
- declined cardinal numbers (e.g. "jednego", "dwóch", "trzech")
- ordinal numbers (e.g. "pierwszy", "druga", "trzecie")
Normalization types can be included and excluded from the grammar through the config file, which is set by default to build_grammar/grammar_config.yaml
.
# cardinals_basic_forms: True
# cardinals_declined: True
# ordinals: True
$ python3 build_grammar/build_grammar.py --grammars-dir all
$ pl_itn \
--tagger all/tagger.fst \
--verbalizer all/verbalizer.fst \
-t "Jeden trzech piąta"
1 3 5
# cardinals_basic_forms: True
# cardinals_declined: False
# ordinals: True
$ python3 build_grammar/build_grammar.py --grammars-dir cardinals_basic_ordinals
$ pl_itn \
--tagger cardinals_basic_ordinals/tagger.fst \
--verbalizer cardinals_basic_ordinals/verbalizer.fst \
-t "Jeden trzech piąta"
1 trzech 5
# cardinals_basic_forms: True
# cardinals_declined: False
# ordinals: False
$ python3 build_grammar/build_grammar.py --grammars-dir only_basic_cardinals
$ pl_itn \
--tagger only_basic_cardinals/tagger.fst \
--verbalizer only_basic_cardinals/verbalizer.fst \
-t "Jeden trzech piąta"
1 trzech piąta
See Documentation for more details.
Documentation
Contributing
License
Rerences
- K. Gorman. 2016. Pynini: A Python library for weighted finite-state grammar compilation. In Proc. ACL Workshop on Statistical NLP and Weighted Automata, 75-80.
- Y. Zhang, E. Bakhturina, K. Gorman, and B. Ginsburg. 2021. NeMo Inverse Text Normalization: From Development To Production.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pl_itn-0.1.0rc1.tar.gz
.
File metadata
- Download URL: pl_itn-0.1.0rc1.tar.gz
- Upload date:
- Size: 647.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6afd5609ea0e06bf95cd35a53a356dac38cfb4899963dc3315b3860ddb16da92 |
|
MD5 | 4795aba0dc43ba4fb0418beea86b4cea |
|
BLAKE2b-256 | 0922e5165b2f4545c3dacc9e229d21a0d8d7b39915abbfc94df9a3897bdd37e2 |
File details
Details for the file pl_itn-0.1.0rc1-py3-none-any.whl
.
File metadata
- Download URL: pl_itn-0.1.0rc1-py3-none-any.whl
- Upload date:
- Size: 670.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c029b517e9e9810abbc7847d0388e17d0577c81a26ef3b2f9147b4d0cf590c0d |
|
MD5 | 8fe5366e3aa22f9b1319fc2bca820f74 |
|
BLAKE2b-256 | 24d1ea94d3d52f57b9060a0ad2d7ee737055a2494e903103de81d35823c6612d |