Common Utility functions for development
Project description
cutility
Common utils for development
Installation
You can install TextCleaner using pip:
# install cutility
pip install cutility
# latest version
pip install --upgrade cutility
Variables
What is project_root?
- Directory that holds your src folder is your
project_root
What is data_root?
- Directory that holds all your data folder is your
data_root
Usage
data folders and logger
from cutility import cutils, logger
# add data folder as per your preference
# add config folder as per your preference
cu = cutils.Cutils(
data_root=f"path/to/data/folder",
config_root=f"path/to/config/folder", # currently only supports .yml files
verbose=True
)
log = logger.Logger()
log.i("This is info message")
# also supports warning critical debug messages
Getting names_list
I have curated list of first names and last names from public github databases and compiled it here in a github gist. Use this command to get names data.
wget https://gist.githubusercontent.com/sagarsrc/e6c7361f9ba6a64b2c9ac5bb10f0285a/raw/fbcca7c6821e7aff285271a6ce42361bbe95cc0c/pii_names.json
Generic cleaner
Use this snippet to collectively apply multiple cleaning functions
from cutility.cleaners.clean import GenCleaner as cc
all_cleaning_steps = [
# text cleaning
(cc.clean_emojis, {}),
(cc.clean_extra_newlines, {}),
(cc.clean_extra_spaces, {}),
(cc.clean_hashtags, {}),
(cc.clean_profile_handle, {}),
(cc.clean_symbols_except_punctuation, {}),
(cc.clean_unicode_characters, {}),
(cc.clean_web_links, {}),
# pii cleaning
(cc.replace_contacts, {"repl": " {{CONTACT}} "}),
(cc.replace_emails, {"repl": " {{EMAIL}} "}),
(cc.replace_names, {"names_list": names_list, "repl": " {{PERSON_NAME}} "}),
]
Text cleaner
Use this snippet to individually apply simple cleaning functions
# Import the TextCleaner class
from cleaners.text_cleaner import TextCleaner
# Create an instance of TextCleaner
tc = TextCleaner()
# Sample text for demonstration
sample_text = "Check out this link: https://example.com. 😎 #Python @user1"
# Step 1: Clean web links
text_without_links = tc.clean_web_links(sample_text)
# Step 2: Clean profile handles
text_without_handles = tc.clean_profile_handle(text_without_links)
# Step 3: Clean hashtags
text_without_hashtags = tc.clean_hashtags(text_without_handles)
# Step 4: Clean emojis
text_without_emojis = tc.clean_emojis(text_without_hashtags)
# Step 5: Clean extra spaces
final_cleaned_text = tc.clean_extra_spaces(text_without_emojis)
# output
'Check out this link: '
PII cleaner
Use this snippet to individually apply PII cleaning functions
from cleaners.pii_cleaner import PiiCleaner
pc = PiiCleaner()
text_with_pii = "John's email is john.doe@example.com, and his phone number is +1 555-1234."
# Replace names with a generic string
text_without_names = pc.replace_names(text_with_pii, names_list=["John", "Doe", "Jane", "Smith"], repl='{{PERSON_NAME}}')
# Replace emails with a generic string
text_without_emails = pc.replace_emails(text_without_names, repl='{{EMAIL}}')
# Replace phone numbers with a generic string
text_without_contacts = pc.replace_contacts(text_without_emails, repl='{{PHONE}}')
print(text_with_pii)
print(text_without_contacts)
# output
"{{PERSON_NAME}}'s email is {{EMAIL}}, and his phone number is {{PHONE}}."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cutility-0.1.1.tar.gz
(13.6 kB
view details)
File details
Details for the file cutility-0.1.1.tar.gz.
File metadata
- Download URL: cutility-0.1.1.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed4451454d535c5db10867100289ec5ee7044da290ea4c3e5605f11a2a8cc754
|
|
| MD5 |
83713bd0dc56316e319681b2e430b585
|
|
| BLAKE2b-256 |
83ada3a340c100387a49b8c09e3cbabe250baff875bbf2db09c12fd19c11e0f2
|