Skip to main content

An easy-to-use package to restore punctuation of text.

Project description

✏️ rpunct - Restore Punctuation

forthebadge

This repo contains code for Punctuation restoration.

This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks. It uses HuggingFace's bert-base-uncased model weights that have been fine-tuned for Punctuation restoration.

Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.

List of punctuations we restore:

  • Upper-casing
  • Period: .
  • Exclamation: !
  • Question Mark: ?
  • Comma: ,
  • Colon: :
  • Semi-colon: ;
  • Apostrophe: '
  • Dash: -

🚀 Usage

Below is a quick way to get up and running with the model.

  1. First, install the package.
pip install rpunct
  1. Sample python code.
from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

🎯 Accuracy

Here is the number of product reviews we used for finetuning the model:

Language Number of text samples
English 560,000

We found the best convergence around 3 epochs, which is what presented here and available via a download.


The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:

Accuracy Overall F1 Eval Support
91% 90% 45,990

💻🎯 Further Fine-Tuning

To start fine-tuning or training please look into training/train.py file. Running python training/train.py will replicate the results of this model.


☕ Contact

Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rpunct-1.0.2.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

rpunct-1.0.2-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file rpunct-1.0.2.tar.gz.

File metadata

  • Download URL: rpunct-1.0.2.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for rpunct-1.0.2.tar.gz
Algorithm Hash digest
SHA256 33415c5efde858b0b4fb3538eb45372fb13bb4440771d106bf798bba8990c8cb
MD5 ee160acf79ab53fe03b5dd8135c9be72
BLAKE2b-256 073267bfdf0c229e26e2fe70df09db89d7d33c495607d9521cc2a7e289af0ac9

See more details on using hashes here.

File details

Details for the file rpunct-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: rpunct-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for rpunct-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 846390eedf8ab7825d82d13e724ea3c58613a5337f2da9bab2ddcd15d14245a5
MD5 8fbd7f63c303297abc20f089630cbc0f
BLAKE2b-256 969ab8c77cf4105d813e999522c298c25481d3bb2fa41c6efdab156190397647

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page