Preprocessing required data for customer service purpose
Project description
preprocessing_pgp
preprocessing_pgp -- The Preprocessing library for any kind of data -- is a suit of open source Python modules, preprocessing techniques supporting research and development in Machine Learning. preprocessing_pgp requires Python version 3.6, 3.7, 3.8, 3.9, 3.10
Installation
To install the current release:
pip install preprocessing-pgp
To install the release with specific version (e.g. 0.1.3):
pip install preprocessing-pgp==0.1.3
To upgrade package to latest version:
pip install --upgrade preprocessing-pgp
Examples
1. Preprocessing Name
python
>>> import preprocessing_pgp as pgp
>>> pgp.preprocess.basic_preprocess_name('Phan Thị Thúy Hằng *$%!@#')
Phan Thị Thúy Hằng
2. Extracting Phones
python
>>> import pandas as pd
>>> from preprocessing_pgp.phone.extractor import extract_valid_phone
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> extracted_data = extract_valid_phone(phones=data, phone_col='phone')
# OF PHONE CLEANED : 0
Sample of non-clean phones:
Empty DataFrame
Columns: [id, phone, clean_phone]
Index: []
100%|██████████| ####/#### [00:00<00:00, ####it/s]
# OF PHONE 10 NUM VALID : ####
# OF PHONE 11 NUM VALID : ####
0it [00:00, ?it/s]
# OF OLD PHONE CONVERTED : ####
# OF OLD LANDLINE PHONE : ####
100%|██████████| ####/#### [00:00<00:00, ####it/s]
# OF VALID PHONE : ####
# OF INVALID PHONE : ####
Sample of invalid phones:
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| | id | phone | is_phone_valid | is_mobi | is_new_mobi | is_old_mobi | is_new_landline | is_old_landline | phone_convert |
+======+=========+=============+==================+===========+===============+===============+===================+===================+=================+
| 47 | ####### | 083###### | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 317 | ####### | 098###### | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 398 | ####### | 039######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 503 | ####### | 093######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1261 | ####### | 096######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1370 | ####### | 097######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1554 | ####### | 098######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2469 | ####### | 032######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2609 | ####### | 086######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2750 | ####### | 078######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
3. Verify Card IDs
python
>>> import pandas as pd
>>> from preprocessing_pgp.card.validation import verify_card
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> verified_data = verify_card(data, card_col='card_id')
##### CLEANSING #####
# CARD ID CONTAINS NON-DIGIT CHARACTERS: 7398
SAMPLE OF CARDS WITH NON-DIGIT CHARACTERS:
card_id splitted_card_id is_valid
####### B####### b####### False
####### C####### c####### False
####### G###### g###### False
####### A######## a######## False
####### ###########k ###########k False
####### ###########k ###########k False
####### C####### c####### False
####### B####### b####### False
####### PT AR####### ptar####### False
####### E######## e######## False
# CARD OF LENGTH 9 OR 12: #######
STATISTIC:
# VALID: #######
# INVALID: #######
# CARD OF LENGTH 8 OR 11: ###
STATISTIC:
# VALID: ###
# INVALID: ###
# CARD WITH OTHER LENGTH: ####
# PASSPORT FOUND: ####
SAMPLE OF PASSPORT:
card_id splitted_card_id is_valid card_length clean_splitted_card_id is_passport
####### B####### b####### True 8 B####### True
####### C####### c####### True 8 C####### True
####### C####### c####### True 8 C####### True
####### B####### b####### True 8 B####### True
####### B####### b####### True 8 B####### True
####### B####### b####### True 8 B####### True
####### C####### c####### True 8 C####### True
####### B####### b####### True 8 B####### True
####### B####### b####### True 8 B####### True
####### B####### b####### True 8 B####### True
##### GENERAL CARD ID REPORT #####
COHORT SIZE: #######
VALID CARD: #######
INVALID CARD: #######
PASSPORT: ####
4. Enrich Vietnamese Names (New Features)
python
>>> import pandas as pd
>>> from preprocessing_pgp.name.enrich_name import process_enrich
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> enrich_data, _ = process_enrich(data, name_col='name')
Basic pre-processing names...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 19669.68it/s]
--------------------
0 names have been clean!
--------------------
Filling diacritics to names...
100%|███████████████████████████████████████| 1000/1000 [01:29<00:00, 11.23it/s]
AVG prediction time : 0.0890703010559082s
Applying rule-based postprocess...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 38292.26it/s]
AVG rb time : 2.671933174133301e-05s
>>> enrich_data.columns
Index(['name', 'predict', 'final'], dtype='object')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
preprocessing-pgp-0.1.6.tar.gz
(33.6 kB
view hashes)
Built Distribution
Close
Hashes for preprocessing_pgp-0.1.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b190b1133218e0bfa0ef7213a601cc770c60732035a72c1344054b4adceee47 |
|
MD5 | 4bf675a20652629c62ff23dd4904156a |
|
BLAKE2b-256 | a92b9a57d793cc23faf6708f758214c46d2bd4a163403257ccab554a3088d488 |