Skip to main content

Vietnamese PII detection with regex recognizers, validators, and context scoring.

Project description

vipii

vipii is a Python library for detecting Vietnamese personally identifiable information (PII) in UTF-8 text. It combines deterministic regex-based recognizers, validator functions, overlap resolution, and Vietnamese context-window scoring to identify structured entities such as national IDs, phone numbers, tax codes, bank identifiers, passports, and vehicle plates.

Install

pip install vipii

For local development:

pip install -e ".[dev]"

Python API

from vipii import PIIDetector, Pattern

detector = PIIDetector()
detector.add_pattern(
    Pattern(label="CUSTOMER_ID", regex=r"\bKH-\d{6}\b", context_words=["mã khách hàng"])
)

matches = detector.detect(
    "Khách hàng Nguyễn Văn A, số điện thoại 0912 345 678, CCCD 001203000123."
)

for match in matches:
    print(match.label, match.text, match.score)

Optional NER

Regex recognizers cover structured PII. For free-form names, locations, organizations, and addresses, enable an external Hugging Face token-classification model:

pip install "vipii[ner]"
vipii scan "Nguyễn Văn A sống tại Hà Nội" --ner-model your-vietnamese-ner-model
from vipii import PIIDetector

detector = PIIDetector(ner_model="your-vietnamese-ner-model")
matches = detector.detect("Nguyễn Văn A sống tại Hà Nội")

The NER layer maps model labels such as PER, LOC, and ORG to PERSON, LOCATION, and ORGANIZATION. The model is not bundled; choose and evaluate one for your domain before production use.

CLI

vipii scan "Số điện thoại 0912 345 678 và CCCD 001203000123"
vipii scan examples/customer_service.txt
vipii scan examples/customer_service.txt --format json
vipii scan examples/customer_service.txt --redact
vipii scan "CCCD 001203000123" --redact
vipii scan "Mã khách hàng KH-123456" --config examples/custom_recognizers.yml
vipii scan "Nguyễn Văn A sống tại Hà Nội" --ner-model your-vietnamese-ner-model

YAML recognizer config

Built-in recognizers are loaded from src/vipii/builtin_recognizers.yml. You can append your own recognizers from a YAML file without writing Python:

recognizers:
  - name: customer_id
    label: CUSTOMER_ID
    patterns:
      - regex: '\bKH-\d{6}\b'
        context_words: ["mã khách hàng", "customer id"]
        base_score: 0.6

Use validator only when you want one of vipii's built-in validators: cccd, cmnd, phone, tax_code, bank_card, bank_account, passport, or vehicle_plate.

Built-in recognizers

  • CCCD and CMND
  • PHONE_NUMBER
  • MST
  • BANK_CARD
  • BANK_ACCOUNT
  • PASSPORT
  • VEHICLE_PLATE

The recognizers intentionally favor clear structured PII plus nearby Vietnamese context words such as số điện thoại, cccd, mã số thuế, and biển số xe. Names and free-form addresses can be handled by the optional NER layer.

Development

pip install -e ".[dev]"
ruff check .
ruff format --check .
pytest

Publishing

Build and inspect the package before uploading:

python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*

Upload to TestPyPI first:

python -m twine upload --repository testpypi dist/*

Then upload the same checked artifacts to PyPI:

python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vipii-0.1.0.tar.gz (201.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vipii-0.1.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file vipii-0.1.0.tar.gz.

File metadata

  • Download URL: vipii-0.1.0.tar.gz
  • Upload date:
  • Size: 201.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for vipii-0.1.0.tar.gz
Algorithm Hash digest
SHA256 69b2a9b2069bc73eb449d7becec16d2315d20f00ea5674d87966d5f2de6a36b7
MD5 597c02c7e98a46a43a462647d1f9eb26
BLAKE2b-256 1bd7a30af68b7269dfed1e4b21203749cecf3329e7572e54fa1db26adc54de7b

See more details on using hashes here.

File details

Details for the file vipii-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vipii-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for vipii-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e34e8a7f3e5c001aa938b3f6783e452ce9402334c29be3547da955c04e1bfcc2
MD5 6ed09e4d92021d253670ab4d3f4ecaf3
BLAKE2b-256 b8166184550cc24a59915d6718258ef8389da51a19a394c647063a7ba5cc90cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page