Japanese entity parser library for company/corporate name normalization and extraction.

Project description

ja-entity-parser

Overview

ja-entity-parser is a Python library for normalization and extraction of Japanese entities: corporate names, personal names, and addresses.
It combines SudachiPy morphological analysis with custom normalization rules (old/new kanji conversion, kanji numeral conversion, bracket/punctuation/control character unification, NFKC, and user dictionary replacements) to accurately parse Japanese text into structured components.

Features

Japanese text normalization: Old/new kanji conversion, kanji numeral → Arabic, bracket/punctuation/control character unification, NFKC, corporate abbreviation expansion ((株) → 株式会社, etc.)
Corporate name parsing: Legal form extraction and brand name/kana via SudachiPy
Personal name parsing: Family/given name split using SudachiPy POS, whitespace, or surname dictionary
Address parsing: State (prefecture) → city → suburb (town) → house_number (block) using Address Base Registry data; block numbers are normalized to canonical form (halfwidth digits and hyphens). Field names follow libpostal label conventions
User dictionary support: Extendable for industry-specific terms
Testing: 66 pytest-based unit and integration tests

Installation

pip install ja-entity-parser

Usage

1. Parse corporate name

from ja_entityparser import parse_corporate

result = parse_corporate("トヨタ自動車株式会社")
print(result)
# {
#   'input': 'トヨタ自動車株式会社',
#   'normalized': 'トヨタ自動車株式会社',
#   'legal_form': '株式会社',
#   'brand_name': 'トヨタ自動車',
#   'brand_kana': 'トヨタジドウシャ'
# }

# Abbreviations are automatically expanded:
result = parse_corporate("(株)ソフトバンク")
# normalized: '株式会社ソフトバンク'

2. Parse person name

from ja_entityparser import parse_person

result = parse_person("田中 太郎")
print(result)
# {
#   'input': '田中 太郎',
#   'normalized': '田中 太郎',
#   'family_name': '田中',
#   'given_name': '太郎',
#   'family_name_kana': 'タナカ',
#   'given_name_kana': 'タロウ'
# }

3. Parse address

The parser splits an address into state, city, suburb, and house_number using the Japanese government's Address Base Registry. Field names follow libpostal label conventions for cross-language address matching. Block numbers are normalized to a canonical halfwidth-digit-and-hyphen form regardless of the input style (fullwidth digits, 丁目/番/号, 番地の, etc.). The original block string is preserved in house_number_raw for auditing.

from ja_entityparser import parse_address

# Example 1: fullwidth digits and hyphens → normalized to halfwidth
result = parse_address("北海道札幌市中央区大通西３丁目１番５号")
print(result)
# {
#   'input': '北海道札幌市中央区大通西３丁目１番５号',
#   'normalized': '北海道札幌市中央区大通西3丁目1番5号',
#   'state': '北海道',
#   'city': '札幌市中央区',
#   'suburb': '大通西',
#   'house_number': '3-1-5',
#   'house_number_raw': '3丁目1番5号'
# }

# Example 2: 番地の format
result = parse_address("愛知県江南市大字小折６２８番地の１")
print(result)
# {
#   'input': '愛知県江南市大字小折６２８番地の１',
#   'normalized': '愛知県江南市大字小折628番地の1',
#   'state': '愛知県',
#   'city': '江南市',
#   'suburb': '大字小折',
#   'house_number': '628-1',
#   'house_number_raw': '628番地の1'
# }

Address data is sourced from the Japanese government's アドレス・ベース・レジストリ (Address Base Registry).

4. Normalization only

from ja_entityparser.normalizer import normalize

text = "〔トヨタ〕(株)テスト 三百二十一号"
print(normalize(text))
# (トヨタ)株式会社テスト 321号

API Reference

Function	Description	Returns
`parse_corporate(text)`	Parse Japanese corporate name	`input`, `normalized`, `legal_form`?, `brand_name`, `brand_kana`
`parse_person(text)`	Parse Japanese person name	`input`, `normalized`, `family_name`?, `given_name`?, `*_kana`?
`parse_address(text)`	Parse Japanese address	`input`, `normalized`, `state`?, `city`?, `suburb`?, `house_number`?, `house_number_raw`?
`normalize(text)`	Normalize Japanese text	`str`

License

Apache License 2.0

Project details

Release history Release notifications | RSS feed

1.0.3

Apr 27, 2026

This version

1.0.2

Apr 26, 2026

1.0.1

Apr 26, 2026

1.0.0

Apr 25, 2026

0.2.1

Aug 16, 2025

0.2.0

Aug 15, 2025

0.1.0

Aug 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ja_entity_parser-1.0.2.tar.gz (52.6 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ja_entity_parser-1.0.2-py3-none-any.whl (46.3 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file ja_entity_parser-1.0.2.tar.gz.

File metadata

Download URL: ja_entity_parser-1.0.2.tar.gz
Upload date: Apr 26, 2026
Size: 52.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for ja_entity_parser-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`039f2161efef2542c9714b340978828b72d0ec6a52f32df6b8e87f923e849599`
MD5	`2346dbf609408be87ea743b0ff7d24c8`
BLAKE2b-256	`4add2184ea5d4e2ed290491ea5873e089777b2edefface942bce1391a20d4f00`

See more details on using hashes here.

File details

Details for the file ja_entity_parser-1.0.2-py3-none-any.whl.

File metadata

Download URL: ja_entity_parser-1.0.2-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 46.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for ja_entity_parser-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4bb88c6b5c6a0d26360953e6de37807c529b530ce6aabe826f347dbc4f132cb7`
MD5	`69e6adb5465af581c64c0dbb7df2afdc`
BLAKE2b-256	`6036958bf8af6b81088703ef6b7e0f56154d219524054630a437f8196636d351`

See more details on using hashes here.

ja-entity-parser 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

ja-entity-parser

Overview

Features

Installation

Usage

1. Parse corporate name

2. Parse person name

3. Parse address

4. Normalization only

API Reference

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes