Light-weight tool for normalizing whitespace and accurately tokenizing words (no regex). Multiple natural languages supported.
Project description
Light-weight tool for normalizing whitespace and accurately tokenizing words (no regex). Multiple natural languages supported. Useful for scrapping, machine learning, and data analysis.
For the full documentation, please see the repository:
https://github.com/Rairye/mnl-ws-norm
Code samples:
split_by_spaces(input_str)
input_str is the string from which words are to be tokenized.
input_str must be passed as a str type.
#Import function
from mnl_ws_norm.normalizer import split_by_spaces
#Source string 1 with half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009). source_str1 = "Hey, everybody, how are you doing?"
#Source string 2 with half-width spaces and a \n character (Unicode: U+000A). source_str2 = "Hey, everybody\nhow are you doing?"
#Source string 3 with half-width spaces and a full-width space (Unicode: U+3000). source_str3 = "Hey, everybody, how are you doing?"
print("source_str1: {}".format(split_by_spaces(source_str1))) print("source_str2: {}".format(split_by_spaces(source_str2))) print("source_str3: {}".format(split_by_spaces(source_str3)))
#There may be some cases where you want to split a string into lines and then split those lines by whitespace character. #In such a case, you can use the splitlines() method.
source_str = "Hey, everybody.\nHow are you doing?\rI am alright."
line_list = source_str.splitlines()
for i in range(len(line_list)): print("Line {}: ".format(i) + str(split_by_spaces(line_list[i])))
norm_spaces(input_str, space_type, remove_extra_spaces = False)
Required arguments -> input_str, space_type
input_str is the string in which the whitespace characters are to be replaced.
input_str must be passed as a str type.
space_type is the string used to replace all whitespace characters in input_str.
space_type must be passed as a str type.
Optional argument -> remove_extra_spaces
By default, extra whitespace characters are not removed from input_str.
Specifying remove_extra_spaces as True removes extra whitespace characters from input_str.
Note: Regardless of the value of remove_extra_spaces, the returned string may have leading/trailing whitespace characters, so you may want to use the strip() method as necessary.
#Import function
from mnl_ws_norm.normalizer import norm_spaces
#Source string with consecutive half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009). source_str = " Hey, everybody, how are you doing? "
#Spaces in source_str are replaced with a half-width space, while extra spaces are ignored. print(norm_spaces(source_str, " "))
#Spaces in source_str are replaced with a half-width space, and extra spaces are removed. print(norm_spaces(source_str, " ", True))
#Spaces in source_str are replaced with a full-width space, while extra spaces are ignored. print(norm_spaces(source_str, " "))
#Spaces in source_str are replaced with a full-width space (Unicode: U+3000), and extra spaces are removed. print(norm_spaces(source_str, " ", True))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mnl_ws_norm-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | caa02fd8f31f0028fed14ddcf13fe9ab02c4d0dd166e1fcb72a184ee1fc8e511 |
|
MD5 | 23daf786d8364ee1b7a9aa0fecdc945d |
|
BLAKE2b-256 | af0af69f8a7f3a2905adaf2edec03bb0caee63f6e2055fd24e9eb36bb59c4f34 |