Skip to main content

Punctuation restoration library

Project description

Punctuation restoration

Adds punctuation and capitalization for a given text without punctuation.

Works on Danish, German and English.

Models hosted on huggingface! ❤️ 🤗

Status with python 3.8

example workflow example workflow

Installation

pip install punctfix

Usage

Its quite simple to use!

>>> from punctfix import PunctFixer
>>> fixer = PunctFixer(language="da")

>>> example_text = "mit navn det er rasmus og jeg kommer fra firmaet alvenir det er mig som har trænet denne lækre model"
>>> print(fixer.punctuate(example_text))
'Mit navn det er Rasmus og jeg kommer fra firmaet Alvenir. Det er mig som har trænet denne lækre model.'

>>> example_text = "en dag bliver vi sku glade for, at vi nu kan sætte punktummer og kommaer i en sætning det fungerer da meget godt ikke"
>>> print(fixer.punctuate(example_text)) 
'En dag bliver vi sku glade for, at vi nu kan sætte punktummer og kommaer i en sætning. Det fungerer da meget godt, ikke?' 

Note that, per default, the input text will be normalied. See next section for more details.

Parameters for PunctFixer

  • Pass device="cuda" or device="cpu" to indicate where to run inference. Default is device="cpu"
  • To handle long sequences, we use a chunk size and an overlap. These can be modified. For higher speed but lower acuracy use a chunk size of 150-200 and very little overlap i.e. 5-10. These parameters are set with default values word_chunk_size=100, word_overlap=70 which makes it run a bit slow. The default parameters will be updated when we have some results on variations.
  • Supported languages are "en" for English, "da" for Danish and "de" for German. Default is language="da".
  • Note that the fixer has been trained on normalized text (lowercase letters and numbers) and will per default normalize input text. You can instantiate the model with skip_normalization=True to disable this but this might yield errors on some input text.
  • To raise warnings every time the input is normalied, set warn_on_normalization=True.

Contribute

If you encounter issues, feel free to open issues in the repo and then we will fix. Even better, create issue and then a PR that fixes the issue! ;-)

Happy punctuating!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

punctfix-0.11.1.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

punctfix-0.11.1-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file punctfix-0.11.1.tar.gz.

File metadata

  • Download URL: punctfix-0.11.1.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.9

File hashes

Hashes for punctfix-0.11.1.tar.gz
Algorithm Hash digest
SHA256 4b4662991982bfc56a6a4d7b3627a5a79d2cd24e1ba50f17efbaabaddb4fb394
MD5 0f19d86d0f319621da5078a2ed0db9cc
BLAKE2b-256 8ccd858c3eef24c723665139263bd27237e21056f6bf7cff330d123f2eae83b6

See more details on using hashes here.

File details

Details for the file punctfix-0.11.1-py3-none-any.whl.

File metadata

  • Download URL: punctfix-0.11.1-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.9

File hashes

Hashes for punctfix-0.11.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ca4d07f52c78ac12b0b4134aea6f0a28c8449a57a61908d4700ac2e4d0ecf3a2
MD5 33ecc50f9eea46ecee469915ea5eb3d1
BLAKE2b-256 b8953fe23dbbafaf15769e387d5aa509268cba93697d07e7fd6f5182b81254ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page