Skip to main content

Japanese entity parser library for company/corporate name normalization and extraction.

Project description

ja-entity-parser

tests PyPI - Version

日本語 / English

Overview

ja-entity-parser is a Python library for normalization and extraction of Japanese entities such as company names, corporate types, personal names, and addresses.
It combines SudachiPy morphological analysis with custom normalization rules (old/new kanji conversion, bracket/punctuation/control character unification, NFKC, and user dictionary replacements) to accurately extract brand names and legal forms (e.g., 株式会社, 合同会社).

Features

  • Japanese text normalization: Old/new kanji conversion, bracket/punctuation/control character unification, NFKC, custom dictionary replacements
  • Company/corporate type extraction: Uses SudachiPy and part-of-speech info
  • User dictionary support: Extendable for industry-specific terms
  • Testing & extensibility: Comes with pytest-based unit tests

Installation

git clone https://github.com/new-village/ja-entity-parser.git
cd ja-entity-parser
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

1. Extract company/corporate info (normalize + parse)

from ja_entityparser import corporate_parser

result = corporate_parser("トヨタ自動車株式会社")
print(result)
# {'input': 'トヨタ自動車株式会社', 'legal_form': '株式会社', 'brand_name': 'トヨタ自動車'}

2. Normalization only

from ja_entityparser.normalizer import normalize

text = "〔トヨタ〕株式会社"
print(normalize(text))
# (トヨタ)株式会社

3. Parsing only (pass normalized text)

from ja_entityparser.parser import parse

result = parse("トヨタ自動車株式会社")
print(result)
# {'legal_form': '株式会社', 'brand_name': 'トヨタ自動車'}

API

  • corporate_parser(text: str) -> dict
    • Normalize and parse input, returns {'input': ..., 'legal_form': ..., 'brand_name': ...}
  • normalize(text: str) -> str
    • Normalize Japanese text
  • parse(text: str) -> dict
    • Morphological analysis and extraction of brand name/legal form

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ja_entity_parser-0.1.0.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ja_entity_parser-0.1.0-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file ja_entity_parser-0.1.0.tar.gz.

File metadata

  • Download URL: ja_entity_parser-0.1.0.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for ja_entity_parser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2baf442b4c806adfb755489f6804a02c971ffeb4247373f240e90a0241692f91
MD5 f11d8b089c9636c8008243d7e6d1b89c
BLAKE2b-256 ff334ef1dd090364cc8477ce7681ac774f7f8933ec6c2e3d7471bb82105f7acf

See more details on using hashes here.

File details

Details for the file ja_entity_parser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ja_entity_parser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 933661dbe2a219d2f578df2f44357ee4169dc6299348a260de9e0abc3f982295
MD5 94688442147ebff54055a5c2ffbfa2c4
BLAKE2b-256 55276bd462e52bfb6bc678e9de3ddd844b4919ca925cca63c335e3b8f36fb447

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page