Skip to main content

Light-weight tool for removing punctuation. Supports multiple natural languages.

Project description

Useful for removing punctuation marks from various natural languages and from text that may contain non-standard punctuation marks (such as emojis and pictographs). Tested with English, Japanese, Chinese, and Korean.

For the full documentation, please see the repository:

https://github.com/Rairye/mnl-punct-norm

Code sample:

#import functions

from mnl_punct_norm.normalizer import is_punct, strip_punct, replace_punct

#Half-width period used in English, etc. print(“Half-width period is_punct -> {}”.format(is_punct(“.”)))

#Full-width period used in Japanese, etc. print(“Full-width period is_punct -> {}”.format(is_punct(”。”)))

#Hiragana character print(“Hiragana character is_punct -> {}”.format(is_punct(“あ”)))

#Kanji print(“Kanji character is_punct -> {}”.format(is_punct(“私”)))

#Emoji example print(”★ is_punct -> {}”.format(is_punct(”★”)))

source_str = “This light-weight module, which provides multi-language support, normalizes punctuation in strings.”

#Strips all punctuation from source_str. print(strip_punct(source_str))

#Strips all punctuation from source_str, except for hyphens. print(strip_punct(source_str, “-“))

#Strips all punctuation from source_str, except for hyphens and commas. print(strip_punct(source_str, “-,”))

japanese_str = “私は人間(にんげん)です。”

#Strips all punctuation from japanese_str. print(strip_punct(japanese_str))

#Strips all punctuation from japanese_str, except for parentheses. print(strip_punct(japanese_str, “()”))

#Replaces all punctuation in source_str with a half-width space. print(replace_punct(source_str))

#Replaces all punctuation in source_str with “ <PUNCT> “. print(replace_punct(source_str, replacement = “ <PUNCT> “))

#Replaces all punctuation in japanese_str with a full-width space. print(replace_punct(japanese_str, replacement = ” “))

#String with multiple punctuation marks. The extra spaces in the string are not normalized by the function. multiple_punct_str = “Wha … what are you doing!?!?”

#Example in which multiple punctuation marks are used in a row. print(replace_punct(multiple_punct_str))

#Example in which multiple punctuation marks are used in a row, with replacement passed as “ <PUNCT> “. print(replace_punct(multiple_punct_str, replacement = “ <PUNCT> “))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mnl_punct_norm-0.0.2.tar.gz (2.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page