An easy-to-use package to restore punctuation of text.
Project description
✏️ rpunct - Restore Punctuation
This repo contains code for Punctuation restoration.
This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks.
It uses HuggingFace's bert-base-uncased
model weights that have been fine-tuned for Punctuation restoration.
Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.
List of punctuations we restore:
- Upper-casing
- Period: .
- Exclamation: !
- Question Mark: ?
- Comma: ,
- Colon: :
- Semi-colon: ;
- Apostrophe: '
- Dash: -
🚀 Usage
Below is a quick way to get up and running with the model.
- First, install the package.
pip install rpunct
- Sample python code.
from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B.
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.
🎯 Accuracy
Here is the number of product reviews we used for finetuning the model:
Language | Number of text samples |
---|---|
English | 560,000 |
We found the best convergence around 3 epochs, which is what presented here and available via a download.
The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:
Accuracy | Overall F1 | Eval Support |
---|---|---|
91% | 90% | 45,990 |
💻🎯 Further Fine-Tuning
To start fine-tuning or training please look into training/train.py
file.
Running python training/train.py
will replicate the results of this model.
☕ Contact
Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rpunct-1.0.2.tar.gz
.
File metadata
- Download URL: rpunct-1.0.2.tar.gz
- Upload date:
- Size: 5.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33415c5efde858b0b4fb3538eb45372fb13bb4440771d106bf798bba8990c8cb |
|
MD5 | ee160acf79ab53fe03b5dd8135c9be72 |
|
BLAKE2b-256 | 073267bfdf0c229e26e2fe70df09db89d7d33c495607d9521cc2a7e289af0ac9 |
File details
Details for the file rpunct-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: rpunct-1.0.2-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 846390eedf8ab7825d82d13e724ea3c58613a5337f2da9bab2ddcd15d14245a5 |
|
MD5 | 8fbd7f63c303297abc20f089630cbc0f |
|
BLAKE2b-256 | 969ab8c77cf4105d813e999522c298c25481d3bb2fa41c6efdab156190397647 |