Punctuation restoration library
Project description
Punctuation restoration
Adds punctuation and capitalization for a given text without punctuation.
Works on Danish, German and English.
Models hosted on huggingface! ❤️ 🤗
Status with python 3.8
Installation
pip install punctfix
Usage
Its quite simple to use!
>>> from punctfix import PunctFixer
>>> fixer = PunctFixer(language="da")
>>> example_text = "mit navn det er rasmus og jeg kommer fra firmaet alvenir det er mig som har trænet denne lækre model"
>>> print(fixer.punctuate(example_text))
'Mit navn det er Rasmus og jeg kommer fra firmaet Alvenir. Det er mig som har trænet denne lækre model.'
>>> example_text = "en dag bliver vi sku glade for, at vi nu kan sætte punktummer og kommaer i en sætning det fungerer da meget godt ikke"
>>> print(fixer.punctuate(example_text))
'En dag bliver vi sku glade for, at vi nu kan sætte punktummer og kommaer i en sætning. Det fungerer da meget godt, ikke?'
Note that, per default, the input text will be normalied. See next section for more details.
Parameters for PunctFixer
- Pass
device="cuda"ordevice="cpu"to indicate where to run inference. Default isdevice="cpu" - To handle long sequences, we use a chunk size and an overlap. These can be modified. For higher speed but
lower acuracy use a chunk size of 150-200 and very little overlap i.e. 5-10. These parameters are set with
default values
word_chunk_size=100,word_overlap=70which makes it run a bit slow. The default parameters will be updated when we have some results on variations. - Supported languages are "en" for English, "da" for Danish and "de" for German. Default is
language="da". - Note that the fixer has been trained on normalized text (lowercase letters and numbers) and will per default normalize input text. You can instantiate the model with
skip_normalization=Trueto disable this but this might yield errors on some input text. - To raise warnings every time the input is normalied, set
warn_on_normalization=True.
Contribute
If you encounter issues, feel free to open issues in the repo and then we will fix. Even better, create issue and then a PR that fixes the issue! ;-)
Happy punctuating!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file punctfix-0.11.1.tar.gz.
File metadata
- Download URL: punctfix-0.11.1.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b4662991982bfc56a6a4d7b3627a5a79d2cd24e1ba50f17efbaabaddb4fb394
|
|
| MD5 |
0f19d86d0f319621da5078a2ed0db9cc
|
|
| BLAKE2b-256 |
8ccd858c3eef24c723665139263bd27237e21056f6bf7cff330d123f2eae83b6
|
File details
Details for the file punctfix-0.11.1-py3-none-any.whl.
File metadata
- Download URL: punctfix-0.11.1-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca4d07f52c78ac12b0b4134aea6f0a28c8449a57a61908d4700ac2e4d0ecf3a2
|
|
| MD5 |
33ecc50f9eea46ecee469915ea5eb3d1
|
|
| BLAKE2b-256 |
b8953fe23dbbafaf15769e387d5aa509268cba93697d07e7fd6f5182b81254ef
|