Skip to main content

Preprocessing required data for customer service purpose

Project description

preprocessing_pgp

PyPI Python License Downloads

preprocessing_pgp -- The Preprocessing library for any kind of data -- is a suit of open source Python modules, preprocessing techniques supporting research and development in Machine Learning. preprocessing_pgp requires Python version 3.6, 3.7, 3.8, 3.9, 3.10

Installation

To install the current release:

pip install preprocessing-pgp

To install the release with specific version (e.g. 0.1.3):

pip install preprocessing-pgp==0.1.3

To upgrade package to latest version:

pip install --upgrade preprocessing-pgp

Examples

1. Preprocessing Name

python
>>> import preprocessing_pgp as pgp
>>> pgp.preprocess.basic_preprocess_name('Phan Thị    Thúy    Hằng *$%!@#')
Phan Thị Thúy Hằng

2. Extracting Phones

python
>>> import pandas as pd
>>> from preprocessing_pgp.phone.extractor import extract_valid_phone
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> extracted_data = extract_valid_phone(phones=data, phone_col='phone')
# OF PHONE CLEANED : 0

Sample of non-clean phones:
Empty DataFrame
Columns: [id, phone, clean_phone]
Index: []

100%|██████████| ####/#### [00:00<00:00, ####it/s]

# OF PHONE 10 NUM VALID : ####


# OF PHONE 11 NUM VALID : ####


0it [00:00, ?it/s]

# OF OLD PHONE CONVERTED : ####


# OF OLD LANDLINE PHONE : ####

100%|██████████| ####/#### [00:00<00:00, ####it/s]

# OF VALID PHONE : ####

# OF INVALID PHONE : ####

Sample of invalid phones:
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
|      |      id |       phone | is_phone_valid   | is_mobi   | is_new_mobi   | is_old_mobi   | is_new_landline   | is_old_landline   | phone_convert   |
+======+=========+=============+==================+===========+===============+===============+===================+===================+=================+
|   47 | ####### |   083###### | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
|  317 | ####### |   098###### | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
|  398 | ####### | 039######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
|  503 | ####### | 093######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1261 | ####### | 096######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1370 | ####### | 097######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1554 | ####### | 098######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2469 | ####### | 032######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2609 | ####### | 086######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2750 | ####### | 078######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+

3. Verify Card IDs

python
>>> import pandas as pd
>>> from preprocessing_pgp.card.validation import verify_card
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> verified_data = verify_card(data, card_col='card_id')

# NON CLEAN CARD ID: ####



# OF VALID CARD LENGTH: ####



# OF POSSIBLE CARD LENGTH: ####



# OF INVALID CARD LENGTH: ####



# CORRECT LENGTH CARD STATISTIC:
True     #####
False    #####
Name: is_valid, dtype: int64



# POSSIBLE LENGTH CARD STATISTIC:
False    #####
True     #####
Name: is_valid, dtype: int64

>>> verified_data.head(3)
+----+--------------+------------+---------------+-----------------+
|    |      card_id | is_valid   |   card_length |   clean_card_id |
+====+==============+============+===============+=================+
|  0 | 035092###### | True       |            12 |    035092###### |
+----+--------------+------------+---------------+-----------------+
|  1 |    14226#### | True       |             9 |       14226#### |
+----+--------------+------------+---------------+-----------------+
|  2 |    15153#### | True       |             9 |       15153#### |
+----+--------------+------------+---------------+-----------------+

4. Enrich Vietnamese Names (New Features)

python
>>> import pandas as pd
>>> from preprocessing_pgp.name.enrich_name import process_enrich
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> enrich_data, _ = process_enrich(data, name_col='name')
Basic pre-processing names...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 19669.68it/s]



--------------------
0 names have been clean!
--------------------




Filling diacritics to names...
100%|███████████████████████████████████████| 1000/1000 [01:29<00:00, 11.23it/s]

AVG prediction time : 0.0890703010559082s



Applying rule-based postprocess...
100%|████████████████████████████████████| 1000/1000 [00:00<00:00, 38292.26it/s]

AVG rb time : 2.671933174133301e-05s


>>> enrich_data.columns
Index(['name', 'predict', 'final'], dtype='object')

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preprocessing-pgp-0.1.5.tar.gz (35.1 kB view details)

Uploaded Source

Built Distribution

preprocessing_pgp-0.1.5-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file preprocessing-pgp-0.1.5.tar.gz.

File metadata

  • Download URL: preprocessing-pgp-0.1.5.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for preprocessing-pgp-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e5ff31e150381092fc480c4cf50b9f0f460634bee9f5f771ded3969f61aec74a
MD5 4ca4e92837f8042f58983db49678f0f1
BLAKE2b-256 41c9c0c23c826c6a0dd17f730e36ca6762f582d0f11eafb092f12230e9b08304

See more details on using hashes here.

File details

Details for the file preprocessing_pgp-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for preprocessing_pgp-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a2aea659bddfbeed5008362a313fc7b12273c4ffefda00287a54bb1ad7466640
MD5 454bb05b10fbc7181e9e181c4757fbda
BLAKE2b-256 70845c8ae9dd041c42d243ba1e6d129218aedd1b41c406d92ef4de5c36e2dd7b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page