Skip to main content

A frontend to punctuation prediction for Icelandic text

Project description

README

Punctuation Prediction

A python package that punctuates Icelandic text. The input data is unpunctuated text and punctuated text is returned. The user can choose between two punctuation models, a bidirectional RNN (Punctuator 2) in Tensorflow 2, and a pretrained ELECTRA Transformer, fine-tuned for punctuation prediction, based on a Hugging Face NER recipe. The pretrained ELECTRA model was trained by Jón Friðrik Daðason on data from the Icelandic Gigaword corpus. Both punctuation models are trained/fine-tuned using Gigaword corpus data.

Table of Contents

Installation

To install, first create a conda environment: conda create --name {venv}

Then activate it: conda activate {venv}

Install the requirement(s): conda install tensorflow==2.1.0 pip install tensorflow==2.1.0could also work, if you run into problems.

The transformer models are created with PyTorch, which is needed when reading the models: conda install pytorch

To use the ELECTRA model one needs access to Hugging Face functions, installed with:

pip install transformers

Then finally, run:

pip install punctuator-isl

Running

The program can be run either from a command line or from inside a python script.

To run it on a command line:

$ punctuate input.txt output.txt

The default model is the biRNN model, you can also specify another model, e.g.:

$ punctuate input.txt output.txt --electra

The input uses stdin and the output stdout. Both files are encoded in UTF-8.

Empty lines in the input are treated as sentence boundaries.

Which of the two models to be used can be specified on the command line. The default is biRNN.

Model Description
biRNN The Punctuator 2 model in Tensorflow.
ELECTRA The ELECTRA Transformer (HuggingFace)

For a short help message of how to use the package, type punctuate -h or punctuate --help.

The input text should be like directly from automatic speech recognition, without capitalizations or punctuations.

The first time the program is run the punctuation models are downloaded into punctuation_models, in the user home directory for Linux and Mac users, and in APPDATA for Windows users. The user can put a new model path in path_config.json inside the punctuator directory is another location is desired.

Example

In this case, the default model is used. An input string is specified and the punctuate function returns a punctuated string, words that appear after an end-of-sentence punctuation mark are capitalized.

$ echo "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu" | punctuate
$ Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.

Python module

The punctuate function

from punctuator-is import punctuate

# A string to be punctuated
s = "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu"

punctuated = punctuate(s, model='biRNN')

print(punctuated)

The program should output:

Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.

License

This code is licensed under the MIT license.

Authors/Credit

Reykjavik University

Main authors: Helga Svala Sigurðardóttir - helgas@ru.is, Inga Rún Helgadóttir - ingarun@ru.is

Acknowledgements

This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

punctuator-lvl9-inga-1.1.29.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

punctuator_lvl9_inga-1.1.29-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file punctuator-lvl9-inga-1.1.29.tar.gz.

File metadata

  • Download URL: punctuator-lvl9-inga-1.1.29.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for punctuator-lvl9-inga-1.1.29.tar.gz
Algorithm Hash digest
SHA256 4b1c6737b4514e20f6e646eb64ca10f86c0586615bd5aa7eff7d79c4b28ae7ea
MD5 34ba265fa8a5ca3f4a4e1aebf5c35a71
BLAKE2b-256 4d36cfe9c6a47c3aa9906d64da5d45918e66c57353af410a886b7ebf6a6ab108

See more details on using hashes here.

File details

Details for the file punctuator_lvl9_inga-1.1.29-py3-none-any.whl.

File metadata

  • Download URL: punctuator_lvl9_inga-1.1.29-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for punctuator_lvl9_inga-1.1.29-py3-none-any.whl
Algorithm Hash digest
SHA256 f5d531a875206d356d13d6b86aa861d3850ba966c70a892122a359014d87d8ea
MD5 83d7cf5c2ee420a3c8c940f6dd2ce876
BLAKE2b-256 7abe4b354dd44da3f162017b8df393495785da2fc146ddfb66e9f4cefda27afb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page