Skip to main content

A frontend to punctuation prediction for Icelandic text

Project description

README

Punctuation Prediction

A python package that punctuates Icelandic text. The input data is unpunctuated text and punctuated text is returned. The user can choose between two punctuation models, a bidirectional RNN (Punctuator 2) in Tensorflow 2, and a pretrained ELECTRA Transformer, fine-tuned for punctuation prediction, based on a Hugging Face NER recipe. The pretrained ELECTRA model was trained by Jón Friðrik Daðason on data from the Icelandic Gigaword corpus. Both punctuation models are trained/fine-tuned using Gigaword corpus data.

Table of Contents

Installation

To install, first create a conda environment: conda create --name {venv}

Then activate it: conda activate {venv}

Install the requirement(s): conda install tensorflow==2.1.0 pip install tensorflow==2.1.0could also work, if you run into problems.

The transformer models are created with PyTorch, which is needed when reading the models: conda install pytorch

To use the ELECTRA model one needs access to Hugging Face functions, installed with:

pip install transformers

Then finally, run:

pip install punctuator-isl

Running

The program can be run either from a command line or from inside a python script.

To run it on a command line:

$ punctuate input.txt output.txt

The default model is the biRNN model, you can also specify another model, e.g.:

$ punctuate input.txt output.txt --electra

The input uses stdin and the output stdout. Both files are encoded in UTF-8.

Empty lines in the input are treated as sentence boundaries.

Which of the two models to be used can be specified on the command line. The default is biRNN.

Model Description
biRNN The Punctuator 2 model in Tensorflow.
ELECTRA The ELECTRA Transformer (HuggingFace)

For a short help message of how to use the package, type punctuate -h or punctuate --help.

The input text should be like directly from automatic speech recognition, without capitalizations or punctuations.

The first time the program is run the punctuation models are downloaded into punctuation_models, in the user home directory for Linux and Mac users, and in APPDATA for Windows users. The user can put a new model path in path_config.json inside the punctuator directory is another location is desired.

Example

In this case, the default model is used. An input string is specified and the punctuate function returns a punctuated string, words that appear after an end-of-sentence punctuation mark are capitalized.

$ echo "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu" | punctuate
$ Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.

Python module

The punctuate function

from punctuator-is import punctuate

# A string to be punctuated
s = "næsti fundur er fyrirhugaður í næstu viku að sögn kristínar jónsdóttur hópstjóra náttúruvárvöktunar hjá veðurstofu íslands verður áfram fylgst grannt með jarðhræringum á svæðinu"

punctuated = punctuate(s, model='biRNN')

print(punctuated)

The program should output:

Næsti fundur er fyrirhugaður í næstu viku. Að sögn kristínar jónsdóttur, hópstjóra náttúruvöktunar hjá veðurstofu íslands, verður áfram fylgst grannt með jarðhræringum á svæðinu.

License

This code is licensed under the MIT license.

Authors/Credit

Reykjavik University

Main authors: Helga Svala Sigurðardóttir - helgas@ru.is, Inga Rún Helgadóttir - ingarun@ru.is

Acknowledgements

This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

punctuator-isl-1.1.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

punctuator_isl-1.1.1-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file punctuator-isl-1.1.1.tar.gz.

File metadata

  • Download URL: punctuator-isl-1.1.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for punctuator-isl-1.1.1.tar.gz
Algorithm Hash digest
SHA256 84e63e1d1b8815d9a06f65481775e0ed5b66a59471accbad683b11b65f4a99e7
MD5 aa300385a5accf8f76687a2f50a2e3dc
BLAKE2b-256 ab06c5122506dcfcb415b1906949fe4828a981c30e7d2cc953d3a296e2f5e6d4

See more details on using hashes here.

File details

Details for the file punctuator_isl-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: punctuator_isl-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7

File hashes

Hashes for punctuator_isl-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9830bf3e7d981793f7690b69e9c91c21e4a2bb7cc7a4a39e8de60a9ebcfb5a3b
MD5 ba1e85093e79910ebd03bfdcdc3e0961
BLAKE2b-256 0a56af51d0b7cf33a3d4a6c9110a2ed48f1be83ff75d98d04d64891765f1211c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page