Light-weight tool for normalizing whitespace and accurately tokenizing words. Multiple natural languages supported.
Project description
Light-weight tool for normalizing whitespace and accurately tokenizing words. Multiple natural languages supported. Useful for scrapping, machine learning, and data analysis.
For the full documentation, please see the repository:
https://github.com/Rairye/mnl-ws-norm
Code samples:
split_by_spaces(input_str)
input_str is the string from which words are to be tokenized.
input_str must be passed as a str type.
#Import function
from mnl_ws_norm.normalizer import split_by_spaces
#Source string 1 with half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009). source_str1 = "Hey, everybody, how are you doing?"
#Source string 2 with half-width spaces and a \n character (Unicode: U+000A). source_str2 = "Hey, everybody\nhow are you doing?"
#Source string 3 with half-width spaces and a full-width space (Unicode: U+3000). source_str3 = "Hey, everybody, how are you doing?"
print("source_str1: {}".format(split_by_spaces(source_str1))) print("source_str2: {}".format(split_by_spaces(source_str2))) print("source_str3: {}".format(split_by_spaces(source_str3)))
#There may be some cases where you want to split a string into lines and then split those lines by whitespace character. #In such a case, you can use the splitlines() method.
source_str = "Hey, everybody.\nHow are you doing?\rI am alright."
line_list = source_str.splitlines()
for i in range(len(line_list)): print("Line {}: ".format(i) + str(split_by_spaces(line_list[i])))
norm_spaces(input_str, space_type, remove_extra_spaces = False)
Required arguments -> input_str, space_type
input_str is the string in which the whitespace characters are to be replaced.
input_str must be passed as a str type.
space_type is the string used to replace all whitespace characters in input_str.
space_type must be passed as a str type.
Optional argument -> remove_extra_spaces
By default, extra whitespace characters are not removed from input_str.
Specifying remove_extra_spaces as True removes extra whitespace characters from input_str.
Note: Regardless of the value of remove_extra_spaces, the returned string may have leading/trailing whitespace characters, so you may want to use the strip() method as necessary.
#Import function
from mnl_ws_norm.normalizer import norm_spaces
#Source string with consecutive half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009). source_str = " Hey, everybody, how are you doing? "
#Spaces in source_str are replaced with a half-width space, while extra spaces are ignored. print(norm_spaces(source_str, " "))
#Spaces in source_str are replaced with a half-width space, and extra spaces are removed. print(norm_spaces(source_str, " ", True))
#Spaces in source_str are replaced with a full-width space, while extra spaces are ignored. print(norm_spaces(source_str, " "))
#Spaces in source_str are replaced with a full-width space (Unicode: U+3000), and extra spaces are removed. print(norm_spaces(source_str, " ", True))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mnl_ws_norm-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d67365f067673423f2c476c626ffb970c4e8254d3260e5ada24bc3a322d81c7 |
|
MD5 | 5938e42242320b66e82f9de26fdb7245 |
|
BLAKE2b-256 | 5670ed123352942cdfdca3b5212e19f7ad351d2f05be1028bdd342f81d4be1c0 |