Skip to main content

Preprocessing required data for customer service purpose

Project description

preprocessing_pgp

PyPI Python License Downloads linting: pylint

preprocessing_pgp -- The Preprocessing library for any kind of data -- is a suit of open source Python modules, preprocessing techniques supporting research and development in Machine Learning. preprocessing_pgp requires Python version 3.6, 3.7, 3.8, 3.9, 3.10


Installation

To install the current release:

pip install preprocessing-pgp

To install the release with specific version (e.g. 0.1.3):

pip install preprocessing-pgp==0.1.3

To upgrade package to latest version:

pip install --upgrade preprocessing-pgp

Features

1. Vietnamese Naming Functions

1.1. Preprocessing Names

python
>>> import preprocessing_pgp as pgp
>>> pgp.preprocess.basic_preprocess_name('Phan Thị    Thúy    Hằng *$%!@#')
Phan Thị Thúy Hằng

1.2. Enrich Vietnamese Names (New Features)

python
>>> import pandas as pd
>>> from preprocessing_pgp.name.enrich_name import process_enrich
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> enrich_data = process_enrich(data, name_col='name')


Cleansing Takes 0m0s


Enrich names takes 5m10s

>>> enrich_data.columns
Index(['name', 'predict', 'final'], dtype='object')

2. Extracting Vietnamese Phones

python
>>> import pandas as pd
>>> from preprocessing_pgp.phone.extractor import extract_valid_phone
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> extracted_data = extract_valid_phone(phones=data, phone_col='phone', print_info=True)
# OF PHONE CLEANED : 0

Sample of non-clean phones:
Empty DataFrame
Columns: [id, phone, clean_phone]
Index: []

100%|██████████| ####/#### [00:00<00:00, ####it/s]

# OF PHONE 10 NUM VALID : ####


# OF PHONE 11 NUM VALID : ####


0it [00:00, ?it/s]

# OF OLD PHONE CONVERTED : ####


# OF OLD LANDLINE PHONE : ####

100%|██████████| ####/#### [00:00<00:00, ####it/s]

# OF VALID PHONE : ####

# OF INVALID PHONE : ####

Sample of invalid phones:
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
|      |      id |       phone | is_phone_valid   | is_mobi   | is_new_mobi   | is_old_mobi   | is_new_landline   | is_old_landline   | phone_convert   |
+======+=========+=============+==================+===========+===============+===============+===================+===================+=================+
|   47 | ####### |   083###### | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
|  317 | ####### |   098###### | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
|  398 | ####### | 039######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
|  503 | ####### | 093######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1261 | ####### | 096######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1370 | ####### | 097######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 1554 | ####### | 098######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2469 | ####### | 032######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2609 | ####### | 086######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+
| 2750 | ####### | 078######## | False            | False     | False         | False         | False             | False             |                 |
+------+---------+-------------+------------------+-----------+---------------+---------------+-------------------+-------------------+-----------------+

3. Verify Vietnamese Card IDs

python
>>> import pandas as pd
>>> from preprocessing_pgp.card.validation import verify_card
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> verified_data = verify_card(data, card_col='card_id', print_info=True)

##### CLEANSING #####


# NAN CARD ID: ####


# CARD ID CONTAINS NON-DIGIT CHARACTERS: ####


SAMPLE OF CARDS WITH NON-DIGIT CHARACTERS:
              card_id  is_valid  is_personal_id
#######      B#######     False           False
#######      C#######     False           False
#######       G######     False           False
#######     A########     False           False
#######  ###########k     False           False
#######  ###########k     False           False
#######      C#######     False           False
#######      B#######     False           False
#######  PT AR#######     False           False
#######     E########     False           False



# CARD OF LENGTH 9 OR 12: #######
STATISTIC:
True     ######
False     #####
Name: is_valid, dtype: int64




# CARD OF LENGTH 8 OR 11: ###
STATISTIC:
True     ######
False     #####
Name: is_valid, dtype: int64



# CARD WITH OTHER LENGTH: ####
# PASSPORT FOUND: ####


SAMPLE OF PASSPORT:
          card_id  is_valid  card_length clean_card_id  is_passport
#######  B#######      True            8      B#######         True
#######  C#######      True            8      C#######         True
#######  C#######      True            8      C#######         True
#######  B#######      True            8      B#######         True
#######  B#######      True            8      B#######         True
#######  B#######      True            8      B#######         True
#######  C#######      True            8      C#######         True
#######  B#######      True            8      B#######         True
#######  B#######      True            8      B#######         True
#######  B#######      True            8      B#######         True




# DRIVER LICENSE FOUND: 41461


SAMPLE OF DRIVER LICENSE:
          card_id  is_valid  is_personal_id  ...  clean_card_id is_passport  is_driver_license
47   0###########      True           False  ...   0###########       False               True
74   0###########      True           False  ...   0###########       False               True
170  0###########      True           False  ...   0###########       False               True
179  0###########      True           False  ...   0###########       False               True
206  0###########      True           False  ...   0###########       False               True
282  0###########      True           False  ...   0###########       False               True
295  0###########      True           False  ...   0###########       False               True
616  0###########      True           False  ...   0###########       False               True
663  0###########      True           False  ...   0###########       False               True
671  0###########      True           False  ...   0###########       False               True


##### GENERAL CARD ID REPORT #####

COHORT SIZE: #######
STATISTIC:
True     ######
False     #####
PASSPORT: ####
DRIVER LICENSE: ####

4. Extract Information in Vietnamese Address

All the region codes traced are retrieve from Đơn Vị Hành Chính Việt Nam

Apart from original columns of dataframe, we also generate columns with specific meanings:

  • cleaned_<address_col> : The cleaned address retrieve from the raw address column
  • level 1 : The raw city extracted from the cleaned address
  • best level 1 : The beautified city traced from extracted raw city
  • level 1 code : The generated city code
  • level 2 : The raw district extracted from the cleaned address
  • best level 2 : The beautified district traced from extracted raw district
  • level 2 code : The generated district code
  • level 3 : The raw ward extracted from the cleaned address
  • best level 3 : The beautified ward traced from extracted raw ward
  • level 3 code : The generated ward code
  • remained address : The remaining address not being extracted
python
>>> import pandas as pd
>>> from preprocessing_pgp.address.extractor import extract_vi_address
>>> data = pd.read_parquet('/path/to/data.parquet')
>>> extracted_data = extract_vi_address(data, address_col='address')
Cleansing takes 0m0s


Extracting takes 0m22s


Code generation takes 0m3s

>>> extracted_data.columns
Index(['address', 'cleaned_address', 'level 1', 'best level 1', 'level 2',
       'best level 2', 'level 3', 'best level 3', 'remained address',
       'level 1 code', 'level 2 code', 'level 3 code'],
      dtype='object')

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preprocessing-pgp-0.1.19.tar.gz (6.5 MB view details)

Uploaded Source

Built Distribution

preprocessing_pgp-0.1.19-py3-none-any.whl (6.6 MB view details)

Uploaded Python 3

File details

Details for the file preprocessing-pgp-0.1.19.tar.gz.

File metadata

  • Download URL: preprocessing-pgp-0.1.19.tar.gz
  • Upload date:
  • Size: 6.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for preprocessing-pgp-0.1.19.tar.gz
Algorithm Hash digest
SHA256 4ea2e4af3b749c3e663fba7bbfe8fd8f7a9174a142877dd4b438038f1f722b24
MD5 13f3660b75e1cd0f63aadfae63de64ad
BLAKE2b-256 4421110e9d9f4516352b32cfed68abe1b6fc8aef0feb302b3cbaac498b7d935a

See more details on using hashes here.

File details

Details for the file preprocessing_pgp-0.1.19-py3-none-any.whl.

File metadata

File hashes

Hashes for preprocessing_pgp-0.1.19-py3-none-any.whl
Algorithm Hash digest
SHA256 60cb9fd3d4877e81bda31c09f4bc700f58d6bb83f0426bad14ae64549f377d64
MD5 c77bd58806bbf7e8b75857166dccc775
BLAKE2b-256 5e3848c126fd6e5d5a05de6dd2dcabdc58c36553424c6774257c5d1b861185cb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page