Skip to main content

A python package for deep multilingual punctuation prediction.

Project description

Deep multilingual punctuation Prediction

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

This uses our "FullStop" model that we trained on the Europarl Dataset. Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.

The code restores the following punctuation markers: "." "," "?" "-" ":"

Install

To get started install the package from pypi:

pip install deepmultilingualpunctuation

Usage

The PunctuationModel class an process texts of any length. Note that processing of very long texts can be time consuming.

Restore Punctuation

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

Result

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

Predict Labels

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

Label EN DE FR IT
0 0.991 0.997 0.992 0.989
. 0.948 0.961 0.945 0.942
? 0.890 0.893 0.871 0.832
, 0.819 0.945 0.831 0.798
: 0.575 0.652 0.620 0.588
- 0.425 0.435 0.431 0.421
macro average 0.775 0.814 0.782 0.762

References

Please cite us if you found this useful:

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepmultilingualpunctuation-1.0.0.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

deepmultilingualpunctuation-1.0.0-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file deepmultilingualpunctuation-1.0.0.tar.gz.

File metadata

  • Download URL: deepmultilingualpunctuation-1.0.0.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.46.0 importlib-metadata/4.10.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for deepmultilingualpunctuation-1.0.0.tar.gz
Algorithm Hash digest
SHA256 02d128427042ae419a0c34401d8a24bf99be12843a5ca5aca6f082e0d557e506
MD5 38cdc8e3d216bcedb4b0ffca092e135a
BLAKE2b-256 05ace694ca09a9b46573a7bcf7d4fdb7f4c4229ff83817fec3f960dbfdcabdb6

See more details on using hashes here.

File details

Details for the file deepmultilingualpunctuation-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: deepmultilingualpunctuation-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.46.0 importlib-metadata/4.10.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for deepmultilingualpunctuation-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb1cd750b1ce4175e61438dd22744139171c6e4cf12d9107232d6c690345b201
MD5 fcb62d905655a5797c54af50ca87f8fc
BLAKE2b-256 870b109294efb7da826d15e6da95aa28c49672ac78a4aef26c8448c77ce9d790

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page