Skip to main content

Common Utility functions for development

Project description

cutility

Common utils for development

Installation

You can install TextCleaner using pip:

# install cutility
pip install cutility
# latest version
pip install --upgrade cutility

Variables

What is project_root?

  • Directory that holds your src folder is your project_root

What is data_root?

  • Directory that holds all your data folder is your data_root

Usage

data folders and logger

from cutility import cutils, logger

# add data folder as per your preference
# add config folder as per your preference
cu = cutils.Cutils(
                    data_root=f"path/to/data/folder",
                    config_root=f"path/to/config/folder", # currently only supports .yml files
                    verbose=True
)


log = logger.Logger()
log.i("This is info message")
# also supports warning critical debug messages

Getting names_list

I have curated list of first names and last names from public github databases and compiled it here in a github gist. Use this command to get names data.

wget https://gist.githubusercontent.com/sagarsrc/e6c7361f9ba6a64b2c9ac5bb10f0285a/raw/fbcca7c6821e7aff285271a6ce42361bbe95cc0c/pii_names.json

Generic cleaner

Use this snippet to collectively apply multiple cleaning functions

from cleaners.clean import GenericClean as cc

all_cleaning_steps = [
    # text cleaning
    (cc.clean_emojis, {}),
    (cc.clean_extra_newlines, {}),
    (cc.clean_extra_spaces, {}),
    (cc.clean_hashtags, {}),
    (cc.clean_profile_handle, {}),
    (cc.clean_symbols_except_punctuation, {}),
    (cc.clean_unicode_characters, {}),
    (cc.clean_web_links, {}),
    # pii cleaning
    (cc.replace_contacts, {"repl": " {{CONTACT}} "}),
    (cc.replace_emails, {"repl": " {{EMAIL}} "}),
    (cc.replace_names, {"names_list": names_list, "repl": " {{PERSON_NAME}} "}),
]

Text cleaner

Use this snippet to individually apply simple cleaning functions

# Import the TextCleaner class
from cleaners.text_cleaner import TextCleaner

# Create an instance of TextCleaner
tc = TextCleaner()

# Sample text for demonstration
sample_text = "Check out this link: https://example.com. 😎 #Python @user1"

# Step 1: Clean web links
text_without_links = tc.clean_web_links(sample_text)

# Step 2: Clean profile handles
text_without_handles = tc.clean_profile_handle(text_without_links)

# Step 3: Clean hashtags
text_without_hashtags = tc.clean_hashtags(text_without_handles)

# Step 4: Clean emojis
text_without_emojis = tc.clean_emojis(text_without_hashtags)

# Step 5: Clean extra spaces
final_cleaned_text = tc.clean_extra_spaces(text_without_emojis)
# output
'Check out this link: '

PII cleaner

Use this snippet to individually apply PII cleaning functions

from cleaners.pii_cleaner import PiiCleaner
pc = PiiCleaner()
text_with_pii = "John's email is john.doe@example.com, and his phone number is +1 555-1234."

# Replace names with a generic string
text_without_names = pc.replace_names(text_with_pii, names_list=["John", "Doe", "Jane", "Smith"], repl='{{PERSON_NAME}}')

# Replace emails with a generic string
text_without_emails = pc.replace_emails(text_without_names, repl='{{EMAIL}}')

# Replace phone numbers with a generic string
text_without_contacts = pc.replace_contacts(text_without_emails, repl='{{PHONE}}')

print(text_with_pii)
print(text_without_contacts)
# output
"{{PERSON_NAME}}'s email is {{EMAIL}}, and his phone number is {{PHONE}}."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutility-0.0.7.tar.gz (13.6 kB view details)

Uploaded Source

File details

Details for the file cutility-0.0.7.tar.gz.

File metadata

  • Download URL: cutility-0.0.7.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for cutility-0.0.7.tar.gz
Algorithm Hash digest
SHA256 737835635e1f72aeeac4ca6144c770e2979eac71b8abebc573a96235c87fd8d1
MD5 97f773f62c663744d226ada7b165d661
BLAKE2b-256 4521c1c70839c005b5bf562e637d3e04ac36d5179a9ea4b3ebf4c584d08f80bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page