Skip to main content

Common Utility functions for development

Project description

cutility

Common utils for development

Installation

You can install TextCleaner using pip:

# install cutility
pip install cutility
# latest version
pip install --upgrade cutility

Variables

What is project_root?

  • Directory that holds your src folder is your project_root

What is data_root?

  • Directory that holds all your data folder is your data_root

Usage

data folders and logger

from cutility import cutils, logger

# add data folder as per your preference
# add config folder as per your preference
cu = cutils.Cutils(
                    data_root=f"path/to/data/folder",
                    config_root=f"path/to/config/folder", # currently only supports .yml files
                    verbose=True
)


log = logger.Logger()
log.i("This is info message")
# also supports warning critical debug messages

Getting names_list

I have curated list of first names and last names from public github databases and compiled it here in a github gist. Use this command to get names data.

wget https://gist.githubusercontent.com/sagarsrc/e6c7361f9ba6a64b2c9ac5bb10f0285a/raw/fbcca7c6821e7aff285271a6ce42361bbe95cc0c/pii_names.json

Generic cleaner

Use this snippet to collectively apply multiple cleaning functions

from cutility.cleaners.clean import GenCleaner as cc

all_cleaning_steps = [
    # text cleaning
    (cc.clean_emojis, {}),
    (cc.clean_extra_newlines, {}),
    (cc.clean_extra_spaces, {}),
    (cc.clean_hashtags, {}),
    (cc.clean_profile_handle, {}),
    (cc.clean_symbols_except_punctuation, {}),
    (cc.clean_unicode_characters, {}),
    (cc.clean_web_links, {}),
    # pii cleaning
    (cc.replace_contacts, {"repl": " {{CONTACT}} "}),
    (cc.replace_emails, {"repl": " {{EMAIL}} "}),
    (cc.replace_names, {"names_list": names_list, "repl": " {{PERSON_NAME}} "}),
]

Text cleaner

Use this snippet to individually apply simple cleaning functions

# Import the TextCleaner class
from cleaners.text_cleaner import TextCleaner

# Create an instance of TextCleaner
tc = TextCleaner()

# Sample text for demonstration
sample_text = "Check out this link: https://example.com. 😎 #Python @user1"

# Step 1: Clean web links
text_without_links = tc.clean_web_links(sample_text)

# Step 2: Clean profile handles
text_without_handles = tc.clean_profile_handle(text_without_links)

# Step 3: Clean hashtags
text_without_hashtags = tc.clean_hashtags(text_without_handles)

# Step 4: Clean emojis
text_without_emojis = tc.clean_emojis(text_without_hashtags)

# Step 5: Clean extra spaces
final_cleaned_text = tc.clean_extra_spaces(text_without_emojis)
# output
'Check out this link: '

PII cleaner

Use this snippet to individually apply PII cleaning functions

from cleaners.pii_cleaner import PiiCleaner
pc = PiiCleaner()
text_with_pii = "John's email is john.doe@example.com, and his phone number is +1 555-1234."

# Replace names with a generic string
text_without_names = pc.replace_names(text_with_pii, names_list=["John", "Doe", "Jane", "Smith"], repl='{{PERSON_NAME}}')

# Replace emails with a generic string
text_without_emails = pc.replace_emails(text_without_names, repl='{{EMAIL}}')

# Replace phone numbers with a generic string
text_without_contacts = pc.replace_contacts(text_without_emails, repl='{{PHONE}}')

print(text_with_pii)
print(text_without_contacts)
# output
"{{PERSON_NAME}}'s email is {{EMAIL}}, and his phone number is {{PHONE}}."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutility-0.0.9.tar.gz (13.6 kB view details)

Uploaded Source

File details

Details for the file cutility-0.0.9.tar.gz.

File metadata

  • Download URL: cutility-0.0.9.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for cutility-0.0.9.tar.gz
Algorithm Hash digest
SHA256 4d240de6a075b5c6fd8335d91e12f953123c5b59895fa3ba7d1976d314afa2d2
MD5 119f794744b2295b4200beb546ac41ab
BLAKE2b-256 b3de6b91e2a76f29a276eb9457ad5cbfdd6aa4d37ad7d0f5d3eff46dbff64221

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page