Vietnamese PII detection with regex recognizers, validators, and context scoring.
Project description
vipii
vipii is a Python library for detecting Vietnamese personally identifiable information (PII) using regex-based and NER-based recognizers.
Install
pip install vipii
For local development:
pip install -e ".[dev]"
Python API
from vipii import PIIDetector, Pattern
detector = PIIDetector()
detector.add_pattern(
Pattern(label="CUSTOMER_ID", regex=r"\bKH-\d{6}\b", context_words=["mã khách hàng"])
)
matches = detector.detect(
"Khách hàng Nguyễn Văn A, số điện thoại 0912 345 678, CCCD 001203000123."
)
for match in matches:
print(match.label, match.text, match.score)
Optional NER
Regex recognizers cover structured PII. For free-form names, locations, organizations, and addresses, enable an external Hugging Face token-classification model:
pip install "vipii[ner]"
vipii scan "Nguyễn Văn A sống tại Hà Nội" --ner-model your-vietnamese-ner-model
from vipii import PIIDetector
detector = PIIDetector(ner_model="your-vietnamese-ner-model")
matches = detector.detect("Nguyễn Văn A sống tại Hà Nội")
The NER layer maps model labels such as PER, LOC, and ORG to PERSON, LOCATION, and
ORGANIZATION. The model is not bundled; choose and evaluate one for your domain before production
use.
CLI
vipii scan "Số điện thoại 0912 345 678 và CCCD 001203000123"
vipii scan examples/customer_service.txt
vipii scan examples/customer_service.txt --format json
vipii scan examples/customer_service.txt --redact
vipii scan "CCCD 001203000123" --redact
vipii scan "Mã khách hàng KH-123456" --config examples/custom_recognizers.yml
vipii scan "Nguyễn Văn A sống tại Hà Nội" --ner-model your-vietnamese-ner-model
YAML recognizer config
Built-in recognizers are loaded from src/vipii/builtin_recognizers.yml. You can append your own
recognizers from a YAML file without writing Python:
recognizers:
- name: customer_id
label: CUSTOMER_ID
patterns:
- regex: '\bKH-\d{6}\b'
context_words: ["mã khách hàng", "customer id"]
base_score: 0.6
Use validator only when you want one of vipii's built-in validators: cccd, cmnd, phone,
tax_code, bank_card, bank_account, passport, or vehicle_plate.
Built-in recognizers
CCCDandCMNDPHONE_NUMBERMSTBANK_CARDBANK_ACCOUNTPASSPORTVEHICLE_PLATE
The recognizers intentionally favor clear structured PII plus nearby Vietnamese context words such as
số điện thoại, cccd, mã số thuế, and biển số xe. Names and free-form addresses can be handled
by the optional NER layer.
Development
pip install -e ".[dev]"
ruff check .
ruff format --check .
pytest
Publishing
Build and inspect the package before uploading:
python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
Upload to TestPyPI first:
python -m twine upload --repository testpypi dist/*
Then upload the same checked artifacts to PyPI:
python -m twine upload dist/*
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vipii-0.1.1.tar.gz.
File metadata
- Download URL: vipii-0.1.1.tar.gz
- Upload date:
- Size: 201.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c896d750d60d2f6828a16def2a720e93b6d7f1f51d30e51064450bb68a66ed33
|
|
| MD5 |
5ce31c21a6b27b486f8b1185e6128a52
|
|
| BLAKE2b-256 |
937c8d97c5b2b821db6d8c81f599225e9a4857dd26132865375e0ed9c09c4eba
|
File details
Details for the file vipii-0.1.1-py3-none-any.whl.
File metadata
- Download URL: vipii-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec3715a53c7e756e8cb358fa8bee6a947aded1b7f329e0778b9e87c49a5f348e
|
|
| MD5 |
9666eb3e1788a01416af3e084467687b
|
|
| BLAKE2b-256 |
3d4f753b218d6813b34fc9d5a6187995578d34b7256421ae1dbe4141cb1e1eb3
|