Preprocessing required data for customer service purpose
Project description
preprocessing_pgp
preprocessing_pgp -- The Preprocessing library for any kind of data -- is a suit of open source Python modules, preprocessing techniques supporting research and development in Machine Learning. preprocessing_pgp requires Python version 3.6, 3.7, 3.8, 3.9, 3.10
Installation
To install the current release:
pip install preprocessing-pgp
To install the release with specific version (e.g. 0.1.3):
pip install preprocessing-pgp==0.1.3
To upgrade package to latest version:
pip install --upgrade preprocessing-pgp
Features
1. Vietnamese Naming Functions
1.1. Preprocessing Names
python
>>> import preprocessing_pgp as pgp
>>> pgp.preprocess.basic_preprocess_name('Phan Thị Thúy Hằng *$%!@#')
Phan Thị Thúy Hằng
1.2. Enrich Vietnamese Names (New Features)
python
>>> import pandas as pd
>>> from preprocessing_pgp.name.enrich_name import process_enrich
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> enrich_data, _ = process_enrich(data, name_col='name')
Basic pre-processing names...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 19669.68it/s]
--------------------
0 names have been clean!
--------------------
Filling diacritics to names...
100%|███████████████████████████████████████| 1000/1000 [01:29<00:00, 11.23it/s]
AVG prediction time : 0.0890703010559082s
Applying rule-based postprocess...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 38292.26it/s]
AVG rb time : 2.671933174133301e-05s
>>> enrich_data.columns
Index(['name', 'predict', 'final'], dtype='object')
2. Extracting Vietnamese Phones
python
>>> import pandas as pd
>>> from preprocessing_pgp.phone.extractor import extract_valid_phone
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> extracted_data = extract_valid_phone(phones=data, phone_col='phone')
# OF PHONE CLEANED : 0
Sample of non-clean phones:
Empty DataFrame
Columns: [id, phone, clean_phone]
Index: []
100%|██████████| ####/#### [00:00<00:00, ####it/s]
# OF PHONE 10 NUM VALID : ####
# OF PHONE 11 NUM VALID : ####
0it [00:00, ?it/s]
# OF OLD PHONE CONVERTED : ####
# OF OLD LANDLINE PHONE : ####
100%|██████████| ####/#### [00:00<00:00, ####it/s]
# OF VALID PHONE : ####
# OF INVALID PHONE : ####
Sample of invalid phones:
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| | id | phone | is_phone_valid | is_mobi | is_new_mobi | is_old_mobi | is_new_landline | is_old_landline | phone_convert |
+======+=========+=============+==================+===========+===============+===============+===================+===================+=================+
| 47 | ####### | 083###### | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 317 | ####### | 098###### | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 398 | ####### | 039######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 503 | ####### | 093######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1261 | ####### | 096######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1370 | ####### | 097######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1554 | ####### | 098######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2469 | ####### | 032######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2609 | ####### | 086######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2750 | ####### | 078######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
3. Verify Vietnamese Card IDs
python
>>> import pandas as pd
>>> from preprocessing_pgp.card.validation import verify_card
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> verified_data = verify_card(data, card_col='card_id')
##### CLEANSING #####
# NAN CARD ID: ####
# CARD ID CONTAINS NON-DIGIT CHARACTERS: ####
SAMPLE OF CARDS WITH NON-DIGIT CHARACTERS:
card_id is_valid is_personal_id
####### B####### False False
####### C####### False False
####### G###### False False
####### A######## False False
####### ###########k False False
####### ###########k False False
####### C####### False False
####### B####### False False
####### PT AR####### False False
####### E######## False False
# CARD OF LENGTH 9 OR 12: #######
STATISTIC:
True ######
False #####
Name: is_valid, dtype: int64
# CARD OF LENGTH 8 OR 11: ###
STATISTIC:
True ######
False #####
Name: is_valid, dtype: int64
# CARD WITH OTHER LENGTH: ####
# PASSPORT FOUND: ####
SAMPLE OF PASSPORT:
card_id is_valid card_length clean_card_id is_passport
####### B####### True 8 B####### True
####### C####### True 8 C####### True
####### C####### True 8 C####### True
####### B####### True 8 B####### True
####### B####### True 8 B####### True
####### B####### True 8 B####### True
####### C####### True 8 C####### True
####### B####### True 8 B####### True
####### B####### True 8 B####### True
####### B####### True 8 B####### True
# DRIVER LICENSE FOUND: 41461
SAMPLE OF DRIVER LICENSE:
card_id is_valid is_personal_id ... clean_card_id is_passport is_driver_license
47 0########### True False ... 0########### False True
74 0########### True False ... 0########### False True
170 0########### True False ... 0########### False True
179 0########### True False ... 0########### False True
206 0########### True False ... 0########### False True
282 0########### True False ... 0########### False True
295 0########### True False ... 0########### False True
616 0########### True False ... 0########### False True
663 0########### True False ... 0########### False True
671 0########### True False ... 0########### False True
##### GENERAL CARD ID REPORT #####
COHORT SIZE: #######
STATISTIC:
True ######
False #####
PASSPORT: ####
DRIVER LICENSE: ####
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file preprocessing-pgp-0.1.12.tar.gz
.
File metadata
- Download URL: preprocessing-pgp-0.1.12.tar.gz
- Upload date:
- Size: 6.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb8c03842fc7c3dd1b5e60ab54b33eb64f2f63d0d6315997a361c6434e5ca193 |
|
MD5 | 97043ba381dc7c253bf551e3b3fa69df |
|
BLAKE2b-256 | e14d2021c165ec4f2862eb9ebccd85fc6d9a6ccdda717040d90812eafc6cbaf0 |
File details
Details for the file preprocessing_pgp-0.1.12-py3-none-any.whl
.
File metadata
- Download URL: preprocessing_pgp-0.1.12-py3-none-any.whl
- Upload date:
- Size: 6.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71eda57e423677bfc1b6c049ab05d9b363e820bb53cf56bc91c349ca84bf4459 |
|
MD5 | b859dbad89d471b3ed411f6f1aed5308 |
|
BLAKE2b-256 | 68bdd2c04a083c2037d75ceb86e775c82c838774db56eaa59e27721565b4bedb |