Skip to main content

A frontend to punctuation prediction for Icelandic text

Project description

README

Punctuation Prediction

A python package that punctuates Icelandic text. The input data is unpunctuated text and punctuated text is returned. The user can choose between two punctuation models, a bidirectional RNN (Punctuator 2) in Tensorflow 2, and a pretrained ELECTRA Transformer, fine-tuned for punctuation prediction, based on a Hugging Face NER recipe. The pretrained ELECTRA model was trained by Jón Friðrik Daðason on data from the Icelandic Gigaword corpus. Both punctuation models are trained/fine-tuned using Gigaword corpus data.

Table of Contents

Installation

To install, first create a conda environment: conda create --name {venv}

Then activate it: conda activate {venv}

Install the requirement(s): conda install tensorflow==2.1.0 pip install tensorflow==2.1.0could also work, if you run into problems.

The transformer models are created with PyTorch, which is needed when reading the models: conda install pytorch

To use the ELECTRA model one needs access to Hugging Face functions, installed with:

pip install transformers

Then finally, run:

pip install punctuator-isl

Running

The program can be run either from a command line or from inside a python script.

To run it on a command line:

$ punctuate input.txt output.txt

The default model is the biRNN model, you can also specify another model, e.g.:

$ punctuate input.txt output.txt --electra

The input uses stdin and the output stdout. Both files are encoded in UTF-8.

Empty lines in the input are treated as sentence boundaries.

Which of the two models to be used can be specified on the command line. The default is biRNN.

Model Description
biRNN The Punctuator 2 model in Tensorflow.
ELECTRA The ELECTRA Transformer (HuggingFace)

For a short help message of how to use the package, type punctuate -h or punctuate --help.

The input text should be like directly from automatic speech recognition, without capitalizations or punctuations.

The first time the program is run the punctuation models are downloaded into punctuation_models, in the user home directory for Linux and Mac users, and in APPDATA for Windows users. The user can put a new model path in path_config.json inside the punctuator directory is another location is desired.

Example

In this case, the default model is used. An input string is specified and the punctuate function returns a punctuated string, words that appear after an end-of-sentence punctuation mark are capitalized.

$ echo "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu" | punctuate
$ Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.

Python module

The punctuate function

from punctuator-is import punctuate

# A string to be punctuated
s = "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu"

punctuated = punctuate(s, model='biRNN')

print(punctuated)

The program should output:

Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.

License

This code is licensed under the MIT license.

Authors/Credit

Reykjavik University

Main authors: Helga Svala Sigurðardóttir - helgas@ru.is, Inga Rún Helgadóttir - ingarun@ru.is

Acknowledgements

This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

punctuator-lvl9-inga-1.1.29.tar.gz (13.7 kB view hashes)

Uploaded Source

Built Distribution

punctuator_lvl9_inga-1.1.29-py3-none-any.whl (13.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page