Preprocessing required data for customer service purpose
Project description
preprocessing_pgp
preprocessing_pgp -- The Preprocessing library for any kind of data -- is a suit of open source Python modules, preprocessing techniques supporting research and development in Machine Learning. preprocessing_pgp requires Python version 3.6, 3.7, 3.8, 3.9, 3.10
Installation
To install the current release:
pip install preprocessing-pgp
To install the release with specific version (e.g. 0.1.3):
pip install preprocessing-pgp==0.1.3
To upgrade package to latest version:
pip install --upgrade preprocessing-pgp
Examples
1. Preprocessing Name
python
>>> import preprocessing_pgp as pgp
>>> pgp.preprocess.basic_preprocess_name('Phan Thị Thúy Hằng *$%!@#')
Phan Thị Thúy Hằng
2. Extracting Phones
python
>>> import pandas as pd
>>> from preprocessing_pgp.phone.extractor import extract_valid_phone
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> extracted_data = extract_valid_phone(phones=data, phone_col='phone')
# OF PHONE CLEANED : 0
Sample of non-clean phones:
Empty DataFrame
Columns: [id, phone, clean_phone]
Index: []
100%|██████████| ####/#### [00:00<00:00, ####it/s]
# OF PHONE 10 NUM VALID : ####
# OF PHONE 11 NUM VALID : ####
0it [00:00, ?it/s]
# OF OLD PHONE CONVERTED : ####
# OF OLD LANDLINE PHONE : ####
100%|██████████| ####/#### [00:00<00:00, ####it/s]
# OF VALID PHONE : ####
# OF INVALID PHONE : ####
Sample of invalid phones:
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| | id | phone | is_phone_valid | is_mobi | is_new_mobi | is_old_mobi | is_new_landline | is_old_landline | phone_convert |
+======+=========+=============+==================+===========+===============+===============+===================+===================+=================+
| 47 | ####### | 083###### | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 317 | ####### | 098###### | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 398 | ####### | 039######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 503 | ####### | 093######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1261 | ####### | 096######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1370 | ####### | 097######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1554 | ####### | 098######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2469 | ####### | 032######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2609 | ####### | 086######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2750 | ####### | 078######## | False | False | False | False | False | False | |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
3. Enrich Vietnamese Names (New Features)
python
>>> import pandas as pd
>>> from preprocessing_pgp.name.enrich_name import process_enrich
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> enrich_data, _ = process_enrich(data, name_col='name')
Basic pre-processing names...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 19669.68it/s]
--------------------
0 names have been clean!
--------------------
Filling diacritics to names...
100%|███████████████████████████████████████| 1000/1000 [01:29<00:00, 11.23it/s]
AVG prediction time : 0.0890703010559082s
Applying rule-based postprocess...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 38292.26it/s]
AVG rb time : 2.671933174133301e-05s
>>> enrich_data.columns
Index(['name', 'predict', 'final'], dtype='object')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
preprocessing-pgp-0.1.4.tar.gz
(31.7 kB
view details)
Built Distribution
File details
Details for the file preprocessing-pgp-0.1.4.tar.gz
.
File metadata
- Download URL: preprocessing-pgp-0.1.4.tar.gz
- Upload date:
- Size: 31.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7895923db45d5293e4f57b1c0aed3b6a7a44f4fc5b017ab0a8ccac680de8c002 |
|
MD5 | e9292bd3e78d856202b2a7090dcb5b09 |
|
BLAKE2b-256 | bb8d99dff7fcbf5861535690f6cd8bd98a0842e3cf9b6052fe86c7568e71dc3b |
File details
Details for the file preprocessing_pgp-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: preprocessing_pgp-0.1.4-py3-none-any.whl
- Upload date:
- Size: 35.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ed6891a0d0f5cc7e54602363446cc7e6a441770b8d4ffdc2a9b60bc9c465dee |
|
MD5 | 1c4eda2731556a5c57fc8343e6bc5e5b |
|
BLAKE2b-256 | 537f7b37fd69adbb0c084674eb8980dff18b18f2d081eed61beacd0132151e9c |