Skip to main content

A frontend to punctuation prediction for Icelandic text

Project description

README

Punctuation Prediction

A python package that punctuates Icelandic text. The input data is unpunctuated text and punctuated text is returned. The user can choose between two punctuation models, a bidirectional RNN (Punctuator 2) in Tensorflow 2, and a pretrained ELECTRA Transformer, fine-tuned for punctuation prediction, based on a Hugging Face NER recipe. The pretrained ELECTRA model was trained by Jón Friðrik Daðason on data from the Icelandic Gigaword corpus. Both punctuation models are trained/fine-tuned using Gigaword corpus data.

Table of Contents

Installation

To install, first create a conda environment: conda create --name {venv}

Then activate it: conda activate {venv}

Install the requirement(s): conda install tensorflow==2.1.0 pip install tensorflow==2.1.0could also work, if you run into problems.

The transformer models are created with PyTorch, which is needed when reading the models: conda install pytorch

To use the ELECTRA model one needs access to Hugging Face functions, installed with:

pip install transformers

Then finally, run:

pip install punctuator-isl

Running

The program can be run either from a command line or from inside a python script.

To run it on a command line:

$ punctuate input.txt output.txt

The default model is the biRNN model, you can also specify another model, e.g.:

$ punctuate input.txt output.txt --electra

The input uses stdin and the output stdout. Both files are encoded in UTF-8.

Empty lines in the input are treated as sentence boundaries.

Which of the two models to be used can be specified on the command line. The default is biRNN.

Model Description
biRNN The Punctuator 2 model in Tensorflow.
ELECTRA The ELECTRA Transformer (HuggingFace)

For a short help message of how to use the package, type punctuate -h or punctuate --help.

The input text should be like directly from automatic speech recognition, without capitalizations or punctuations.

The first time the program is run the punctuation models are downloaded into punctuation_models, in the user home directory for Linux and Mac users, and in APPDATA for Windows users. The user can put a new model path in path_config.json inside the punctuator directory is another location is desired.

Example

In this case, the default model is used. An input string is specified and the punctuate function returns a punctuated string, words that appear after an end-of-sentence punctuation mark are capitalized.

$ echo "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu" | punctuate
$ Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.

Python module

The punctuate function

from punctuator-is import punctuate

# A string to be punctuated
s = "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu"

punctuated = punctuate(s, model='biRNN')

print(punctuated)

The program should output:

Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.

License

This code is licensed under the MIT license.

Authors/Credit

Reykjavik University

Main authors: Helga Svala Sigurðardóttir - helgas@ru.is, Inga Rún Helgadóttir - ingarun@ru.is

Acknowledgements

This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

punctuator-lvl9-inga-1.1.26.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

punctuator_lvl9_inga-1.1.26-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file punctuator-lvl9-inga-1.1.26.tar.gz.

File metadata

  • Download URL: punctuator-lvl9-inga-1.1.26.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for punctuator-lvl9-inga-1.1.26.tar.gz
Algorithm Hash digest
SHA256 2897f2d88e69554fb9eb472c8ade068a1d35e71bace253262bafdcec6c0ac301
MD5 138eb8a752848da3aa4056c12419120c
BLAKE2b-256 7e28b2edb323abae766629e66e409f22d2b046970870c759ff9f7f91f7e63fad

See more details on using hashes here.

File details

Details for the file punctuator_lvl9_inga-1.1.26-py3-none-any.whl.

File metadata

  • Download URL: punctuator_lvl9_inga-1.1.26-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for punctuator_lvl9_inga-1.1.26-py3-none-any.whl
Algorithm Hash digest
SHA256 fcef89c83be39068a9e814f1222e11280d36435473fc5769167a484e53f4d3bf
MD5 91153ab7a215b28ff18c4a2cbe671131
BLAKE2b-256 b8ad1152cf6546620e9ac62339224612247114ee5ad6317178a9b7ba1981ebf0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page