A python package for deep multilingual punctuation prediction.
Project description
Deep multilingual punctuation Prediction
This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.
This uses our "FullStop" model that we trained on the Europarl Dataset. Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.
The code restores the following punctuation markers: "." "," "?" "-" ":"
Install
To get started install the package from pypi:
pip install deepmultilingualpunctuation
Usage
The PunctuationModel
class an process texts of any length. Note that processing of very long texts can be time consuming.
Restore Punctuation
model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)
Result
My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?
Predict Labels
model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)
[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]
Results
The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:
Label | EN | DE | FR | IT |
---|---|---|---|---|
0 | 0.991 | 0.997 | 0.992 | 0.989 |
. | 0.948 | 0.961 | 0.945 | 0.942 |
? | 0.890 | 0.893 | 0.871 | 0.832 |
, | 0.819 | 0.945 | 0.831 | 0.798 |
: | 0.575 | 0.652 | 0.620 | 0.588 |
- | 0.425 | 0.435 | 0.431 | 0.421 |
macro average | 0.775 | 0.814 | 0.782 | 0.762 |
References
Please cite us if you found this useful:
@article{guhr-EtAl:2021:fullstop,
title={FullStop: Multilingual Deep Models for Punctuation Prediction},
author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},
booktitle = {Proceedings of the Swiss Text Analytics Conference 2021},
month = {June},
year = {2021},
address = {Winterthur, Switzerland},
publisher = {CEUR Workshop Proceedings},
url = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for deepmultilingualpunctuation-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02d128427042ae419a0c34401d8a24bf99be12843a5ca5aca6f082e0d557e506 |
|
MD5 | 38cdc8e3d216bcedb4b0ffca092e135a |
|
BLAKE2b-256 | 05ace694ca09a9b46573a7bcf7d4fdb7f4c4229ff83817fec3f960dbfdcabdb6 |
Hashes for deepmultilingualpunctuation-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb1cd750b1ce4175e61438dd22744139171c6e4cf12d9107232d6c690345b201 |
|
MD5 | fcb62d905655a5797c54af50ca87f8fc |
|
BLAKE2b-256 | 870b109294efb7da826d15e6da95aa28c49672ac78a4aef26c8448c77ce9d790 |