Preprocessing required data for customer service purpose
Project description
preprocessing_pgp
preprocessing_pgp -- The Preprocessing library for any kind of data -- is a suit of open source Python modules, preprocessing techniques supporting research and development in Machine Learning. preprocessing_pgp requires Python version 3.6, 3.7, 3.8, 3.9, 3.10
Installation
To install the current release:
pip install preprocessing-pgp
To install the release with specific version (e.g. 0.1.3):
pip install preprocessing-pgp==0.1.3
To upgrade package to latest version:
pip install --upgrade preprocessing-pgp
Examples
1. Preprocessing Name
python
>>> import preprocessing_pgp as pgp
>>> pgp.preprocess.basic_preprocess_name('Phan Thị Thúy Hằng *$%!@#')
Phan Thị Thúy Hằng
2. Extracting Phones
python
>>> import pandas as pd
>>> from preprocessing_pgp.phone.extractor import extract_valid_phone
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> extracted_data = extract_valid_phone(phones=data, phone_col='phone')
# OF PHONE CLEANED : 0
Sample of non-clean phones:
Empty DataFrame
Columns: [id, phone, clean_phone]
Index: []
100%|██████████| ####/#### [00:00<00:00, ####it/s]
# OF PHONE 10 NUM VALID : ####
# OF PHONE 11 NUM VALID : ####
0it [00:00, ?it/s]
# OF OLD PHONE CONVERTED : ####
# OF OLD LANDLINE PHONE : ####
100%|██████████| ####/#### [00:00<00:00, ####it/s]
# OF VALID PHONE : ####
# OF INVALID PHONE : ####
Sample of invalid phones:
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| | id | phone | is_phone_valid | is_mobi | is_new_mobi | is_old_mobi | is_new_landline | is_old_landline | phone_convert |
+======+=========+=============+==================+===========+===============+===============+===================+===================+=================+
| 47 | ####### | 083###### | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 317 | ####### | 098###### | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 398 | ####### | 039######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 503 | ####### | 093######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1261 | ####### | 096######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1370 | ####### | 097######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1554 | ####### | 098######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2469 | ####### | 032######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2609 | ####### | 086######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2750 | ####### | 078######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
3. Verify Card IDs
python
>>> import pandas as pd
>>> from preprocessing_pgp.card.validation import verify_card
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> verified_data = verify_card(data, card_col='card_id')
# NON CLEAN CARD ID: ####
# OF VALID CARD LENGTH: ####
# OF POSSIBLE CARD LENGTH: ####
# OF INVALID CARD LENGTH: ####
# CORRECT LENGTH CARD STATISTIC:
True #####
False #####
Name: is_valid, dtype: int64
# POSSIBLE LENGTH CARD STATISTIC:
False #####
True #####
Name: is_valid, dtype: int64
>>> verified_data.head(3)
+----+--------------+------------+---------------+-----------------+
| | card_id | is_valid | card_length | clean_card_id |
+====+==============+============+===============+=================+
| 0 | 035092###### | True | 12 | 035092###### |
+----+--------------+------------+---------------+-----------------+
| 1 | 14226#### | True | 9 | 14226#### |
+----+--------------+------------+---------------+-----------------+
| 2 | 15153#### | True | 9 | 15153#### |
+----+--------------+------------+---------------+-----------------+
4. Enrich Vietnamese Names (New Features)
python
>>> import pandas as pd
>>> from preprocessing_pgp.name.enrich_name import process_enrich
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> enrich_data, _ = process_enrich(data, name_col='name')
Basic pre-processing names...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 19669.68it/s]
--------------------
0 names have been clean!
--------------------
Filling diacritics to names...
100%|███████████████████████████████████████| 1000/1000 [01:29<00:00, 11.23it/s]
AVG prediction time : 0.0890703010559082s
Applying rule-based postprocess...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 38292.26it/s]
AVG rb time : 2.671933174133301e-05s
>>> enrich_data.columns
Index(['name', 'predict', 'final'], dtype='object')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
preprocessing-pgp-0.1.5.tar.gz
(35.1 kB
view details)
Built Distribution
File details
Details for the file preprocessing-pgp-0.1.5.tar.gz
.
File metadata
- Download URL: preprocessing-pgp-0.1.5.tar.gz
- Upload date:
- Size: 35.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5ff31e150381092fc480c4cf50b9f0f460634bee9f5f771ded3969f61aec74a |
|
MD5 | 4ca4e92837f8042f58983db49678f0f1 |
|
BLAKE2b-256 | 41c9c0c23c826c6a0dd17f730e36ca6762f582d0f11eafb092f12230e9b08304 |
File details
Details for the file preprocessing_pgp-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: preprocessing_pgp-0.1.5-py3-none-any.whl
- Upload date:
- Size: 41.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2aea659bddfbeed5008362a313fc7b12273c4ffefda00287a54bb1ad7466640 |
|
MD5 | 454bb05b10fbc7181e9e181c4757fbda |
|
BLAKE2b-256 | 70845c8ae9dd041c42d243ba1e6d129218aedd1b41c406d92ef4de5c36e2dd7b |