mnl-ws-norm

Light-weight tool for normalizing whitespace and accurately tokenizing words. Multiple natural languages supported.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Light-weight tool for normalizing whitespace and accurately tokenizing words. Multiple natural languages supported. Useful for scrapping, machine learning, and data analysis.

For the full documentation, please see the repository:

https://github.com/Rairye/mnl-ws-norm

Code samples:

split_by_spaces(input_str)

input_str is the string from which words are to be tokenized.

input_str must be passed as a str type.

#Import function

from mnl_ws_norm.normalizer import split_by_spaces

#Source string 1 with half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009). source_str1 = "Hey, everybody, how are you doing?"

#Source string 2 with half-width spaces and a \n character (Unicode: U+000A). source_str2 = "Hey, everybody\nhow are you doing?"

#Source string 3 with half-width spaces and a full-width space (Unicode: U+3000). source_str3 = "Hey, everybody, how are you doing?"

print("source_str1: {}".format(split_by_spaces(source_str1))) print("source_str2: {}".format(split_by_spaces(source_str2))) print("source_str3: {}".format(split_by_spaces(source_str3)))

#There may be some cases where you want to split a string into lines and then split those lines by whitespace character. #In such a case, you can use the splitlines() method.

source_str = "Hey, everybody.\nHow are you doing?\rI am alright."

line_list = source_str.splitlines()

for i in range(len(line_list)): print("Line {}: ".format(i) + str(split_by_spaces(line_list[i])))

norm_spaces(input_str, space_type, remove_extra_spaces = False)

Required arguments -> input_str, space_type

input_str is the string in which the whitespace characters are to be replaced.

input_str must be passed as a str type.

space_type is the string used to replace all whitespace characters in input_str.

space_type must be passed as a str type.

Optional argument -> remove_extra_spaces

By default, extra whitespace characters are not removed from input_str.

Specifying remove_extra_spaces as True removes extra whitespace characters from input_str.

Note: Regardless of the value of remove_extra_spaces, the returned string may have leading/trailing whitespace characters, so you may want to use the strip() method as necessary.

#Import function

from mnl_ws_norm.normalizer import norm_spaces

#Source string with consecutive half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009). source_str = " Hey, everybody, how are you doing? "

#Spaces in source_str are replaced with a half-width space, while extra spaces are ignored. print(norm_spaces(source_str, " "))

#Spaces in source_str are replaced with a half-width space, and extra spaces are removed. print(norm_spaces(source_str, " ", True))

#Spaces in source_str are replaced with a full-width space, while extra spaces are ignored. print(norm_spaces(source_str, "　"))

#Spaces in source_str are replaced with a full-width space (Unicode: U+3000), and extra spaces are removed. print(norm_spaces(source_str, " 　", True))

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.4

Oct 29, 2021

0.0.3

Oct 29, 2021

0.0.2

Oct 29, 2021

This version

0.0.1

Oct 17, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mnl_ws_norm-0.0.1.tar.gz (3.4 kB view hashes)

Uploaded Oct 17, 2021 Source

Built Distribution

mnl_ws_norm-0.0.1-py3-none-any.whl (4.1 kB view hashes)

Uploaded Oct 17, 2021 Python 3

Hashes for mnl_ws_norm-0.0.1.tar.gz

Hashes for mnl_ws_norm-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`4ff2db7bfee3d127a09718c5e1b8016cb0780cd6d212e3776d7ba63015ad93c3`
MD5	`e750d6ea3e903108635f7d1525a99c6b`
BLAKE2b-256	`09d3709ecec8fe306e83eb4c506084bebee3ac3fb8540aa4d62753a58e4cab23`

Hashes for mnl_ws_norm-0.0.1-py3-none-any.whl

Hashes for mnl_ws_norm-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d67365f067673423f2c476c626ffb970c4e8254d3260e5ada24bc3a322d81c7`
MD5	`5938e42242320b66e82f9de26fdb7245`
BLAKE2b-256	`5670ed123352942cdfdca3b5212e19f7ad351d2f05be1028bdd342f81d4be1c0`