Japanese entity parser library for company/corporate name normalization and extraction.
Project description
ja-entity-parser
Overview
ja-entity-parser is a Python library for normalization and extraction of Japanese entities such as company names, corporate types, personal names, and addresses.
It combines SudachiPy morphological analysis with custom normalization rules (old/new kanji conversion, bracket/punctuation/control character unification, NFKC, and user dictionary replacements) to accurately extract brand names and legal forms (e.g., 株式会社, 合同会社).
Features
- Japanese text normalization: Old/new kanji conversion, bracket/punctuation/control character unification, NFKC, custom dictionary replacements
- Company/corporate type extraction: Uses SudachiPy and part-of-speech info
- Katakana reading output: Builds brand_kana by concatenating token reading forms
- User dictionary support: Extendable for industry-specific terms
- Testing & extensibility: Comes with pytest-based unit tests
Installation
pip install ja-entity-parser
Usage
1. Extract company/corporate info (normalize + parse)
from ja_entityparser import corporate_parser
result = corporate_parser("トヨタ自動車株式会社")
print(result)
# {'input': 'トヨタ自動車株式会社', 'legal_form': '株式会社', 'brand_name': 'トヨタ自動車', 'brand_kana': 'トヨタジドウシャ'}
2. Normalization only
from ja_entityparser.normalizer import normalize
text = "〔トヨタ〕株式会社"
print(normalize(text))
# (トヨタ)株式会社
3. Parsing only (pass normalized text)
from ja_entityparser.parser import parse
result = parse("トヨタ自動車株式会社")
print(result)
# {'legal_form': '株式会社', 'brand_name': 'トヨタ自動車', 'brand_kana': 'トヨタジドウシャ'}
API
corporate_parser(text: str) -> dict- Normalize and parse input, returns
{'input': ..., 'legal_form': ..., 'brand_name': ..., 'brand_kana': ...}(brand_kana when available)
- Normalize and parse input, returns
normalize(text: str) -> str- Normalize Japanese text
parse(text: str) -> dict- Morphological analysis and extraction of brand name/legal form (and brand_kana when available)
License
Apache License 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ja_entity_parser-0.2.0.tar.gz.
File metadata
- Download URL: ja_entity_parser-0.2.0.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
889c81e46c1452b0e090324de5f698eced3f56daf05f9458df268899e27b21c3
|
|
| MD5 |
494add7aba72b04ef4a326e70c54d926
|
|
| BLAKE2b-256 |
1998e152a1ed6fa46d6b0db336fc050291cd214f8275921528949dbd36f42676
|
File details
Details for the file ja_entity_parser-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ja_entity_parser-0.2.0-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44a4373cd86875d72d2a5d3e496c70eb545fa0c0a708e39670a02a5bb5cdcd11
|
|
| MD5 |
a34121082ece6b9544211af43a67bb74
|
|
| BLAKE2b-256 |
bccda7faa7d1f646a3c9d4b8a3d34979160159b620804640d2ed7a6ad1ae975a
|