Skip to main content

package to clean and normalize text

Project description

clean_text_rhoni

The clean_text_rhoni package provides tools to efficiently clean and transform text data. It offers a set of methods and functions for removing special characters, accents, and unnecessary spaces from text, as well as converting text to lowercase and snake case style. This package is useful for preparing text data for natural language processing tasks, data analysis, and other applications where clean and normalized text is nedeed.

Installation

$ pip install clean_text_rhoni

Usage

This package has 2 main functions to clean a text:

clean_text function performs a complete text cleaning process on the input text. The cleaning operations include removing leading and trailing spaces, replacing multiple spaces with a single space, converting text to lowercase, removing accents, removing special characters, and removing the tilde from 'ñ'.

clean_text_snake_case function performs the same comprehensive text cleaning process as clean_text, and additionally transforms the cleaned text into snake case style by replacing spaces with underscores. This is useful for creating consistent and readable variable or column names.

from clean_text_rhoni import clean_text, clean_text_snake_case

sample_text = "%ábdc    efghí   %$ñ"

# clean_text()
# run a complete cleaning over a text

cleaned_text = clean_text(sample_text)
print(cleaned_text) # 'abdc efghi n'

# clean_text_snake_case()
# run a complete cleaning over a text and return the result in snake_case style

snake_case_cleaned_text = clean_text_snake_case(sample_text)
print(snake_case_cleaned_text) # abdc_efghi_n

You can also access the BaseCleanText class and use its methods separately:

from clean_text_rhoni import BaseCleanText

# create a class instance
instance_base_clean_text = BaseCleanText()

# call the chosen method
instance_base_clean_text.remove_accents("áéíóú") #'aeiou'

instance_base_clean_text.replace_underscores_by_spaces("hello_world") #'hello world'

The BaseCleanText class has the following methods:

  • transform_to_lowercase(text): Converts the input text to lowercase.

  • remove_leading_trailing_spaces(text): Removes leading and trailing white spaces from the input text.

  • replace_multiple_spaces(text): Removes multiple spaces in the input text and replaces them with a single space.

  • remove_special_characters(text): Removes special characters from the input text. Special characters are defined as characters that are neither alphanumeric nor whitespace characters. A regular expression is used to match and remove these characters.

  • remove_accents(text): Removes accents from vowels in the input text. It replaces accented vowel characters (e.g., á, é, í) with their non-accented counterparts (e.g., a, e, i).

  • remove_n_tilde(text): Removes the tilde from the character 'ñ' in the input text, replacing it with a regular 'n'.

  • replace_spaces_by_underscores(text): Replaces spaces with underscores in the input text.

  • replace_underscores_by_spaces(text): Replaces underscores with spaces in the input text.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

clean_text_rhoni was created by rhoni. It is licensed under the terms of the MIT license.

Credits

clean_text_rhoni was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clean_text_rhoni-0.1.14.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

clean_text_rhoni-0.1.14-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file clean_text_rhoni-0.1.14.tar.gz.

File metadata

  • Download URL: clean_text_rhoni-0.1.14.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for clean_text_rhoni-0.1.14.tar.gz
Algorithm Hash digest
SHA256 d9f0399e5106e3542f6e90a594e60cf0cfe45782093f60a440245034d6d6a6ef
MD5 910c643e37105eb8e5c4bd0220fafa6a
BLAKE2b-256 70d997c19344f4d3d433814e00d9de99d784d64a23aba82de3a7812227433f63

See more details on using hashes here.

File details

Details for the file clean_text_rhoni-0.1.14-py3-none-any.whl.

File metadata

File hashes

Hashes for clean_text_rhoni-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 04f0f3857a5a06fd31f0f247db9970c10aad26accdab3bbfcf18f0ff951ffc2f
MD5 cff1afec1df84e4eb599cb1a1190621b
BLAKE2b-256 f3710997a2f04258a4b2d0fb5e05c278cef57df53bb0bf2f1d6f07b13a8db92b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page