Skip to main content

Japanese entity parser library for company/corporate name normalization and extraction.

Project description

ja-entity-parser

Test PyPI - Version

日本語 / English

Overview

ja-entity-parser is a Python library for normalization and extraction of Japanese entities such as company names, corporate types, personal names, and addresses.
It combines SudachiPy morphological analysis with custom normalization rules (old/new kanji conversion, bracket/punctuation/control character unification, NFKC, and user dictionary replacements) to accurately extract brand names and legal forms (e.g., 株式会社, 合同会社).

Features

  • Japanese text normalization: Old/new kanji conversion, bracket/punctuation/control character unification, NFKC, custom dictionary replacements
  • Company/corporate type extraction: Uses SudachiPy and part-of-speech info
  • Katakana reading output: Builds brand_kana by concatenating token reading forms
  • User dictionary support: Extendable for industry-specific terms
  • Testing & extensibility: Comes with pytest-based unit tests

Installation

pip install ja-entity-parser

Usage

1. Extract company/corporate info (normalize + parse)

from ja_entityparser import corporate_parser

result = corporate_parser("トヨタ自動車株式会社")
print(result)
# {'input': 'トヨタ自動車株式会社', 'legal_form': '株式会社', 'brand_name': 'トヨタ自動車', 'brand_kana': 'トヨタジドウシャ'}

2. Normalization only

from ja_entityparser.normalizer import normalize

text = "〔トヨタ〕株式会社"
print(normalize(text))
# (トヨタ)株式会社

3. Parsing only (pass normalized text)

from ja_entityparser.parser import parse

result = parse("トヨタ自動車株式会社")
print(result)
# {'legal_form': '株式会社', 'brand_name': 'トヨタ自動車', 'brand_kana': 'トヨタジドウシャ'}

API

  • corporate_parser(text: str) -> dict
    • Normalize and parse input, returns {'input': ..., 'legal_form': ..., 'brand_name': ..., 'brand_kana': ...} (brand_kana when available)
  • normalize(text: str) -> str
    • Normalize Japanese text
  • parse(text: str) -> dict
    • Morphological analysis and extraction of brand name/legal form (and brand_kana when available)

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ja_entity_parser-0.2.0.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ja_entity_parser-0.2.0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file ja_entity_parser-0.2.0.tar.gz.

File metadata

  • Download URL: ja_entity_parser-0.2.0.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for ja_entity_parser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 889c81e46c1452b0e090324de5f698eced3f56daf05f9458df268899e27b21c3
MD5 494add7aba72b04ef4a326e70c54d926
BLAKE2b-256 1998e152a1ed6fa46d6b0db336fc050291cd214f8275921528949dbd36f42676

See more details on using hashes here.

File details

Details for the file ja_entity_parser-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ja_entity_parser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 44a4373cd86875d72d2a5d3e496c70eb545fa0c0a708e39670a02a5bb5cdcd11
MD5 a34121082ece6b9544211af43a67bb74
BLAKE2b-256 bccda7faa7d1f646a3c9d4b8a3d34979160159b620804640d2ed7a6ad1ae975a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page