Local-first Japanese PII anonymization engine
Project description
Besshouka (別称化)
A local-first Japanese PII anonymization engine. Besshouka detects personally identifiable information (PII), payment card data (PCI), and protected health information (PHI) in Japanese text and transforms it using configurable rules — all without sending data to any external service.
Note: Besshouka is in early development (alpha). It is not yet recommended for production use. Contributions to improve accuracy, coverage, and robustness are welcome — see CONTRIBUTING.md.
Why Besshouka?
- Japanese-native — built specifically for Japanese data patterns: マイナンバー, Japanese phone formats, postal codes, full-width character handling, and GiNZA-powered NER for names, organizations, and locations.
- Local-first — everything runs on your machine. No cloud APIs, no data leaves the device.
- Pluggable — add custom regex recognizers via YAML, write your own operators in Python, or plug in any importable function as a custom operator. No forking required.
- Auditable — every anonymization operation is logged in an audit trail with the original text, the operator used, and the new indices.
Quick Start
pip install besshouka
Anonymize text
besshouka anonymize "田中太郎の電話番号は090-1234-5678です"
# Output: <氏名>の電話番号は090-1234-****です
Analyze (detect only)
besshouka analyze --explain "田中太郎の電話番号は090-1234-5678です"
Adjust confidence threshold
Both commands support --threshold / -t to filter by confidence score:
# Anonymize: only anonymize detections with confidence >= threshold (default: 0.5)
besshouka anonymize --threshold 0.3 "番号は123456789018です"
# Analyze: only display detections with confidence >= threshold (default: 0.0)
besshouka analyze --threshold 0.5 --explain "マイナンバーは123456789018です"
Detections below the threshold are still detected internally but excluded from output. For example, a 12-digit number matching the My Number check digit but lacking context keywords scores 0.4 and is left untouched at the default anonymization threshold.
Use custom rules
besshouka anonymize \
--recognizers my_patterns.yaml \
--rules my_operators.yaml \
--input document.txt \
--output anonymized.txt
Programmatic Usage
from besshouka.config.loader import load_recognizer_config, load_operator_config
from besshouka.orchestrator.pipeline import run
rec_config = load_recognizer_config("path/to/recognizers.yaml")
op_config = load_operator_config("path/to/operators.yaml")
ctx = run("田中太郎の電話番号は090-1234-5678です", rec_config, op_config,
score_threshold=0.5)
print(ctx.engine_result.text) # anonymized text
print(ctx.engine_result.items) # audit trail
Architecture
Text In → [Analyzer] → [Anonymizer] → Text Out
| Module | Role |
|---|---|
| Analyzer | Detects PII using regex patterns + GiNZA NER |
| Anonymizer | Transforms PII using pluggable operators |
| Orchestrator | Wires analyzer and anonymizer into a pipeline |
Each module has its own README with extension guides. See the besshouka/ directory.
Built-in Recognizers
| Pattern | Entity Type |
|---|---|
| Mobile phone | PHONE_NUMBER |
| Landline phone | PHONE_NUMBER |
| Toll-free phone | PHONE_NUMBER |
| Email address | EMAIL |
| マイナンバー | MY_NUMBER (check digit + context-aware scoring) |
| Postal code | POSTAL_CODE |
| Credit card | CREDIT_CARD |
| Bank account | BANK_ACCOUNT |
| Driver's license | DRIVERS_LICENSE |
| Passport | PASSPORT |
| Person names | PERSON (GiNZA) |
| Organizations | ORGANIZATION (GiNZA) |
| Locations | LOCATION (GiNZA) |
Built-in Operators
| Operator | What it does |
|---|---|
replace |
Substitute with a fixed value |
mask |
Mask characters from end with a symbol |
redact |
Remove entirely |
hash |
Salted SHA-256 hex digest |
encrypt |
Fernet symmetric encryption |
keep |
Pass through unchanged |
custom |
Call any importable Python function |
Extending Besshouka
Add a regex recognizer (no code)
Add an entry to your recognizers YAML:
recognizers:
- name: employee_id
entity_type: EMPLOYEE_ID
pattern: 'EMP-[A-Z]{2}\d{6}'
score: 1.0
source: custom
Add a custom operator (no subclassing)
Write a function anywhere importable:
def my_transform(text: str, params: dict) -> str:
return text[::-1] # reverse it, or whatever you need
Reference it in your operators YAML:
operators:
EMPLOYEE_ID:
method: custom
function: "my_module.my_transform"
Development
git clone https://github.com/akhi/besshouka.git
cd besshouka
pip install -e ".[dev]"
Running Tests
# All tests (excluding slow GiNZA model tests)
pytest tests/ -m "not slow"
# All tests including GiNZA
pytest tests/
# With coverage
pytest tests/ --cov=besshouka --cov-report=term-missing
Requirements
- Python >=3.11, <3.14 — Python 3.14 is not yet supported due to PyO3 compatibility with SudachiPy (GiNZA's tokenizer). Python 3.13 is recommended.
- GiNZA / spaCy (for NER)
- See requirements.txt for full list
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file besshouka-0.1.1a2.tar.gz.
File metadata
- Download URL: besshouka-0.1.1a2.tar.gz
- Upload date:
- Size: 38.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
606833d4bcbcdce0b1c838fb24b993970fba64147c53b31c72cc61c584c879e6
|
|
| MD5 |
6c8a54641210cbf2f4c2b7eb11a4e449
|
|
| BLAKE2b-256 |
93a0ff88e31568432483ec461d57ed3e417b4a715ce0cac6e2051e0f12c2a478
|
Provenance
The following attestation bundles were made for besshouka-0.1.1a2.tar.gz:
Publisher:
release.yml on go-akhi/besshouka
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
besshouka-0.1.1a2.tar.gz -
Subject digest:
606833d4bcbcdce0b1c838fb24b993970fba64147c53b31c72cc61c584c879e6 - Sigstore transparency entry: 1066688404
- Sigstore integration time:
-
Permalink:
go-akhi/besshouka@d225b7aa3a48650420262c2f53ea609ca90137f1 -
Branch / Tag:
refs/tags/v0.1.1-alpha.2 - Owner: https://github.com/go-akhi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d225b7aa3a48650420262c2f53ea609ca90137f1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file besshouka-0.1.1a2-py3-none-any.whl.
File metadata
- Download URL: besshouka-0.1.1a2-py3-none-any.whl
- Upload date:
- Size: 37.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
810660424c1f32403eb544ddc87b7c271ba6b3a8279938161ce5f0fbd28c5802
|
|
| MD5 |
f97bedf2a519eb48f7deaad838eb9ae1
|
|
| BLAKE2b-256 |
533aa871ea00ff2d05257ccdbbe43930e7451256dd1083ef6838b2e7e6bba5bb
|
Provenance
The following attestation bundles were made for besshouka-0.1.1a2-py3-none-any.whl:
Publisher:
release.yml on go-akhi/besshouka
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
besshouka-0.1.1a2-py3-none-any.whl -
Subject digest:
810660424c1f32403eb544ddc87b7c271ba6b3a8279938161ce5f0fbd28c5802 - Sigstore transparency entry: 1066688405
- Sigstore integration time:
-
Permalink:
go-akhi/besshouka@d225b7aa3a48650420262c2f53ea609ca90137f1 -
Branch / Tag:
refs/tags/v0.1.1-alpha.2 - Owner: https://github.com/go-akhi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d225b7aa3a48650420262c2f53ea609ca90137f1 -
Trigger Event:
release
-
Statement type: