Skip to main content

Vietnamese PII detection with regex recognizers, validators, and context scoring.

Project description

vipii

vipii is a Python library for detecting Vietnamese personally identifiable information (PII) using regex-based and NER-based recognizers.

Install

pip install vipii

For local development:

pip install -e ".[dev]"

For Spark DataFrame support:

pip install "vipii[spark]"

Python API

from vipii import PIIDetector, Pattern

detector = PIIDetector()
detector.add_pattern(
    Pattern(label="CUSTOMER_ID", regex=r"\bKH-\d{6}\b", context_words=["mã khách hàng"])
)

matches = detector.detect(
    "Khách hàng Nguyễn Văn A, số điện thoại 0912 345 678, CCCD 001203000123."
)

for match in matches:
    print(match.label, match.text, match.score)

Concurrent scanning

PIIDetector.detect() runs recognizers concurrently by default when the detector has more than one recognizer. Use max_workers to cap the internal recognizer thread pool, or set max_workers=1 to force sequential recognition:

from vipii import PIIDetector

detector = PIIDetector(max_workers=4)
matches = detector.detect("Số điện thoại 0912 345 678 và CCCD 001203000123")

When scanning many independent texts, you can run calls to detect() concurrently from your own executor. Configure the detector before starting workers, then treat it as read-only while scans are running; do not call add_pattern(), add_recognizer(), or add_ner_model() concurrently with detection.

from concurrent.futures import ThreadPoolExecutor

from vipii import PIIDetector

texts = [
    "Khách hàng A có số điện thoại 0912 345 678.",
    "Khách hàng B có CCCD 001203000123.",
]
detector = PIIDetector(max_workers=1)

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(detector.detect, texts))

PySpark

The Spark adapter is optional and keeps PySpark imports lazy. It can add detected matches or redacted text to a DataFrame text column:

from vipii.spark import with_pii_matches, with_redacted_column

df = spark.createDataFrame(
    [("Số CCCD của tôi là 001203000123",)],
    ["text"],
)

matches_df = with_pii_matches(df, input_col="text", output_col="pii_matches")
redacted_df = with_redacted_column(df, input_col="text", output_col="redacted")

Optional NER

Regex recognizers cover structured PII. For free-form names, locations, organizations, and addresses, enable an external Hugging Face token-classification model:

pip install "vipii[ner]"
vipii scan "Nguyễn Văn A sống tại Hà Nội" --ner-model your-vietnamese-ner-model
from vipii import PIIDetector

detector = PIIDetector(ner_model="your-vietnamese-ner-model")
matches = detector.detect("Nguyễn Văn A sống tại Hà Nội")

The NER layer maps model labels such as PER, LOC, and ORG to PERSON, LOCATION, and ORGANIZATION. The model is not bundled; choose and evaluate one for your domain before production use.

To reduce model inference cost, choose an NER strategy:

  • always: run pattern recognizers and NER on the full text.
  • fallback: run NER only when pattern recognizers find no structured PII.
  • uncovered: run pattern recognizers first, then run NER only on text outside detected spans.
  • chunked: split text into chunks, redact structured PII spans, then run NER on useful chunks.
  • never: skip NER even if a model is configured.
vipii scan "Số điện thoại 0912345678" --ner-model your-vietnamese-ner-model --ner-strategy fallback
vipii scan "Số điện thoại 0912345678 của Nguyễn Văn A" --ner-model your-vietnamese-ner-model --ner-strategy uncovered
vipii scan "Số điện thoại 0912345678 của Nguyễn Văn A" --ner-model your-vietnamese-ner-model --ner-strategy chunked
detector = PIIDetector(ner_model="your-vietnamese-ner-model", ner_strategy="fallback")
detector = PIIDetector(ner_model="your-vietnamese-ner-model", ner_strategy="uncovered")
detector = PIIDetector(ner_model="your-vietnamese-ner-model", ner_strategy="chunked")

CLI

vipii scan "Số điện thoại 0912 345 678 và CCCD 001203000123"
vipii scan examples/customer_service.txt
vipii scan examples/customer_service.txt --format json
vipii scan examples/customer_service.txt --redact
vipii scan "CCCD 001203000123" --redact
vipii scan "Mã khách hàng KH-123456" --config examples/custom_recognizers.yml
vipii scan "Nguyễn Văn A sống tại Hà Nội" --ner-model your-vietnamese-ner-model
vipii scan "Số điện thoại 0912345678" --ner-model your-vietnamese-ner-model --ner-strategy fallback
vipii scan "Số điện thoại 0912345678 của Nguyễn Văn A" --ner-model your-vietnamese-ner-model --ner-strategy uncovered
vipii scan "Số điện thoại 0912345678 của Nguyễn Văn A" --ner-model your-vietnamese-ner-model --ner-strategy chunked

YAML recognizer config

Built-in recognizers are loaded from src/vipii/builtin_recognizers.yml. You can append your own recognizers from a YAML file without writing Python:

recognizers:
  - name: customer_id
    label: CUSTOMER_ID
    patterns:
      - regex: '\bKH-\d{6}\b'
        context_words: ["mã khách hàng", "customer id"]
        base_score: 0.6

Use validator only when you want one of vipii's built-in validators: cccd, cmnd, phone, email_address, date_of_birth, tax_code, bank_card, bank_account, social_insurance, health_insurance, passport, vehicle_plate, driver_license, ip_address, or device_id.

Built-in recognizers

  • CCCD and CMND
  • PHONE_NUMBER
  • EMAIL_ADDRESS
  • DATE_OF_BIRTH
  • MST
  • SOCIAL_INSURANCE_NUMBER
  • HEALTH_INSURANCE_NUMBER
  • BANK_CARD
  • BANK_ACCOUNT
  • PASSPORT
  • VEHICLE_PLATE
  • DRIVER_LICENSE
  • IP_ADDRESS
  • DEVICE_ID

The recognizers intentionally favor clear structured PII plus nearby Vietnamese context words such as số điện thoại, cccd, mã số thuế, and biển số xe. Names and free-form addresses can be handled by the optional NER layer.

Development

pip install -e ".[dev]"
ruff check .
ruff format --check .
pytest

Publishing

Publishing is handled manually from GitHub Actions. On the release branch, run the Publish workflow with Run workflow and enter the version to publish, for example 0.1.3.

To inspect a package locally before publishing:

python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vipii-0.1.3.tar.gz (211.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vipii-0.1.3-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file vipii-0.1.3.tar.gz.

File metadata

  • Download URL: vipii-0.1.3.tar.gz
  • Upload date:
  • Size: 211.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vipii-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f3aae76f4f4449ac781e52af44fb933025ff6c0c2d63511109a192faab0dae30
MD5 6b63738882b42f7616fa6c5894c83410
BLAKE2b-256 27ad41f3485ebad39c3ec5c5c4fda995d8ec89653cf889510ef03121dee4d506

See more details on using hashes here.

File details

Details for the file vipii-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: vipii-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vipii-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b38fd6279421b6fc6d9d4e437d2f80b469e509d1967572c6cf5fefa672b12609
MD5 a10c35a7c053e79c6bfcb5a398e24527
BLAKE2b-256 de9b53703f6ad4e099af75a2825b10b058280567cde380801ca155914fadbebe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page