Skip to main content

Local-first Japanese PII anonymization engine

Project description

Besshouka (別称化)

A local-first Japanese PII anonymization engine. Besshouka detects personally identifiable information (PII), payment card data (PCI), and protected health information (PHI) in Japanese text and transforms it using configurable rules — all without sending data to any external service.

Note: Besshouka is in early development (alpha). It is not yet recommended for production use. Contributions to improve accuracy, coverage, and robustness are welcome — see CONTRIBUTING.md.

Why Besshouka?

  • Japanese-native — built specifically for Japanese data patterns: マイナンバー, Japanese phone formats, postal codes, full-width character handling, and GiNZA-powered NER for names, organizations, and locations.
  • Local-first — everything runs on your machine. No cloud APIs, no data leaves the device.
  • Pluggable — add custom regex recognizers via YAML, write your own operators in Python, or plug in any importable function as a custom operator. No forking required.
  • Auditable — every anonymization operation is logged in an audit trail with the original text, the operator used, and the new indices.

Quick Start

pip install besshouka

Anonymize text

besshouka anonymize "田中太郎の電話番号は090-1234-5678です"
# Output: <氏名>の電話番号は090-1234-****です

Analyze (detect only)

besshouka analyze --explain "田中太郎の電話番号は090-1234-5678です"

Adjust confidence threshold

Both commands support --threshold / -t to filter by confidence score:

# Anonymize: only anonymize detections with confidence >= threshold (default: 0.5)
besshouka anonymize --threshold 0.3 "番号は123456789018です"

# Analyze: only display detections with confidence >= threshold (default: 0.0)
besshouka analyze --threshold 0.5 --explain "マイナンバーは123456789018です"

Detections below the threshold are still detected internally but excluded from output. For example, a 12-digit number matching the My Number check digit but lacking context keywords scores 0.4 and is left untouched at the default anonymization threshold.

Use custom rules

besshouka anonymize \
  --recognizers my_patterns.yaml \
  --rules my_operators.yaml \
  --input document.txt \
  --output anonymized.txt

Programmatic Usage

from besshouka.config.loader import load_recognizer_config, load_operator_config
from besshouka.orchestrator.pipeline import run

rec_config = load_recognizer_config("path/to/recognizers.yaml")
op_config = load_operator_config("path/to/operators.yaml")

ctx = run("田中太郎の電話番号は090-1234-5678です", rec_config, op_config,
          score_threshold=0.5)

print(ctx.engine_result.text)   # anonymized text
print(ctx.engine_result.items)  # audit trail

Architecture

Text In → [Analyzer] → [Anonymizer] → Text Out
Module Role
Analyzer Detects PII using regex patterns + GiNZA NER
Anonymizer Transforms PII using pluggable operators
Orchestrator Wires analyzer and anonymizer into a pipeline

Each module has its own README with extension guides. See the besshouka/ directory.

Built-in Recognizers

Pattern Entity Type
Mobile phone PHONE_NUMBER
Landline phone PHONE_NUMBER
Toll-free phone PHONE_NUMBER
Email address EMAIL
マイナンバー MY_NUMBER (check digit + context-aware scoring)
Postal code POSTAL_CODE
Credit card CREDIT_CARD
Bank account BANK_ACCOUNT
Driver's license DRIVERS_LICENSE
Passport PASSPORT
Person names PERSON (GiNZA)
Organizations ORGANIZATION (GiNZA)
Locations LOCATION (GiNZA)

Built-in Operators

Operator What it does
replace Substitute with a fixed value
mask Mask characters from end with a symbol
redact Remove entirely
hash Salted SHA-256 hex digest
encrypt Fernet symmetric encryption
keep Pass through unchanged
custom Call any importable Python function

Extending Besshouka

Add a regex recognizer (no code)

Add an entry to your recognizers YAML:

recognizers:
  - name: employee_id
    entity_type: EMPLOYEE_ID
    pattern: 'EMP-[A-Z]{2}\d{6}'
    score: 1.0
    source: custom

Add a custom operator (no subclassing)

Write a function anywhere importable:

def my_transform(text: str, params: dict) -> str:
    return text[::-1]  # reverse it, or whatever you need

Reference it in your operators YAML:

operators:
  EMPLOYEE_ID:
    method: custom
    function: "my_module.my_transform"

Development

git clone https://github.com/akhi/besshouka.git
cd besshouka
pip install -e ".[dev]"

Running Tests

# All tests (excluding slow GiNZA model tests)
pytest tests/ -m "not slow"

# All tests including GiNZA
pytest tests/

# With coverage
pytest tests/ --cov=besshouka --cov-report=term-missing

Requirements

  • Python >=3.11, <3.14 — Python 3.14 is not yet supported due to PyO3 compatibility with SudachiPy (GiNZA's tokenizer). Python 3.13 is recommended.
  • GiNZA / spaCy (for NER)
  • See requirements.txt for full list

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

besshouka-0.1.1a2.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

besshouka-0.1.1a2-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file besshouka-0.1.1a2.tar.gz.

File metadata

  • Download URL: besshouka-0.1.1a2.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for besshouka-0.1.1a2.tar.gz
Algorithm Hash digest
SHA256 606833d4bcbcdce0b1c838fb24b993970fba64147c53b31c72cc61c584c879e6
MD5 6c8a54641210cbf2f4c2b7eb11a4e449
BLAKE2b-256 93a0ff88e31568432483ec461d57ed3e417b4a715ce0cac6e2051e0f12c2a478

See more details on using hashes here.

Provenance

The following attestation bundles were made for besshouka-0.1.1a2.tar.gz:

Publisher: release.yml on go-akhi/besshouka

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file besshouka-0.1.1a2-py3-none-any.whl.

File metadata

  • Download URL: besshouka-0.1.1a2-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for besshouka-0.1.1a2-py3-none-any.whl
Algorithm Hash digest
SHA256 810660424c1f32403eb544ddc87b7c271ba6b3a8279938161ce5f0fbd28c5802
MD5 f97bedf2a519eb48f7deaad838eb9ae1
BLAKE2b-256 533aa871ea00ff2d05257ccdbbe43930e7451256dd1083ef6838b2e7e6bba5bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for besshouka-0.1.1a2-py3-none-any.whl:

Publisher: release.yml on go-akhi/besshouka

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page