Light-weight tool for removing punctuation. Supports multiple natural languages.
Project description
Useful for removing punctuation marks from various natural languages and from text that may contain non-standard punctuation marks (such as emojis and pictographs). Tested with English, Japanese, Chinese, and Korean.
For the full documentation, please see the repository:
https://github.com/Rairye/mnl-punct-norm
Code sample:
#import functions
from mnl_punct_norm.normalizer import is_punct, strip_punct, replace_punct
#Half-width period used in English, etc. print("Half-width period is_punct -> {}".format(is_punct(".")))
#Full-width period used in Japanese, etc. print("Full-width period is_punct -> {}".format(is_punct("。")))
#Hiragana character print("Hiragana character is_punct -> {}".format(is_punct("あ")))
#Kanji print("Kanji character is_punct -> {}".format(is_punct("私")))
#Emoji example print("★ is_punct -> {}".format(is_punct("★")))
source_str = "This light-weight module, which provides multi-language support, normalizes punctuation in strings."
#Strips all punctuation from source_str. print(strip_punct(source_str))
#Strips all punctuation from source_str, except for hyphens. print(strip_punct(source_str, "-"))
#Strips all punctuation from source_str, except for hyphens and commas. print(strip_punct(source_str, "-,"))
japanese_str = "私は人間(にんげん)です。"
#Strips all punctuation from japanese_str. print(strip_punct(japanese_str))
#Strips all punctuation from japanese_str, except for parentheses. print(strip_punct(japanese_str, "()"))
#Replaces all punctuation in source_str with a half-width space. print(replace_punct(source_str))
#Replaces all punctuation in source_str with " ". print(replace_punct(source_str, replacement = " "))
#Replaces all punctuation in japanese_str with a full-width space. print(replace_punct(japanese_str, replacement = " "))
#String with multiple punctuation marks. The extra spaces in the string are not normalized by the function. multiple_punct_str = "Wha ... what are you doing!?!?"
#Example in which multiple punctuation marks are used in a row. print(replace_punct(multiple_punct_str))
#Example in which multiple punctuation marks are used in a row, with replacement passed as " ". print(replace_punct(multiple_punct_str, replacement = " "))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mnl_punct_norm-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b84188560664cd24c5f356f4cd210e998f3e3b46c726232bbba713f446c2ee55 |
|
MD5 | f411757a04a1bf18c0dd8255b4796b14 |
|
BLAKE2b-256 | 59ba018ad5972d6fe9e8a8fb02f73fb48c7498cdbafeaa105d1da0c2fd734f1d |