Skip to main content

A library for standardizing terms with spelling variations using a synonym dictionary.

Project description

Python License PyPI Downloads

yurenizer

This is a Japanese text normalizer that resolves spelling inconsistencies.

Japanese README is Here.(日本語のREADMEはこちら)
https://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md

Overview

yurenizer is a tool for detecting and unifying variations in Japanese text notation.
For example, it can unify variations like "パソコン" (pasokon), "パーソナル・コンピュータ" (personal computer), and "パーソナルコンピュータ" into "パーソナルコンピューター".
These rules follow the Sudachi Synonym Dictionary.

web-based Demo

You can try the web-based demo here.
yurenizer Web-demo

Installation

pip install yurenizer

Download Synonym Dictionary

curl -L -o /path/to/synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt

Usage

Quick Start

from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="path/to/synonym_file_path")
text = "「パソコン」は「パーソナルコンピュータ」の「synonym」で、「パーソナル・コンピュータ」と表記することもあります。"
print(normalizer.normalize(text))
# Output: 「パーソナルコンピューター」は「パーソナルコンピューター」の「シノニム」で、「パーソナルコンピューター」と表記することもあります。

Customizing Settings

You can control normalization by specifying NormalizerConfig as an argument to the normalize function.

Example with Custom Settings

from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="path/to/synonym_file_path")
text = "「東日本旅客鉄道」は「JR東」や「JR-East」とも呼ばれます"
config = NormalizerConfig(
            unify_level="lexeme",
            taigen=True, 
            yougen=False,
            expansion="from_another", 
            other_language=False,
            alias=False,
            old_name=False,
            misuse=False,
            alphabetic_abbreviation=True, # Normalize only alphabetic abbreviations
            non_alphabetic_abbreviation=False,
            alphabet=False,
            orthographic_variation=False,
            misspelling=False
        )
print(f"Output: {normalizer.normalize(text, config)}")
# Output: 「東日本旅客鉄道」は「JR東」や「東日本旅客鉄道」とも呼ばれます

Configuration Details

The settings in yurenizer are organized hierarchically, allowing you to control the scope and target of normalization.


1. unify_level (Normalization Level)

First, specify the level of normalization with the unify_level parameter.

Value Description
lexeme Performs the most comprehensive normalization, targeting all groups (a, b, c) mentioned below.
word_form Normalizes by word form, targeting groups b and c.
abbreviation Normalizes by abbreviation, targeting group c only.

2. taigen / yougen (Target Selection)

Use the taigen and yougen flags to control which parts of speech are included in the normalization.

Setting Default Value Description
taigen True Includes nouns and other substantives in the normalization. Set to False to exclude them.
yougen False Includes verbs and other predicates in the normalization. Set to True to include them (normalized to their lemma).

3. expansion (Expansion Flag)

The expansion flag determines how synonyms are expanded based on the synonym dictionary's internal control flags.

Value Description
from_another Expands only the synonyms with a control flag value of 0 in the synonym dictionary.
any Expands all synonyms regardless of their control flag value.

4. Detailed Normalization Settings (a, b, c Groups)

a Group: Comprehensive Lexical Normalization

Controls normalization based on vocabulary and semantics using the following settings:

Setting Default Value Description
other_language True Normalizes non-Japanese terms (e.g., English) to Japanese. Set to False to disable this feature.
alias True Normalizes aliases. Set to False to disable this feature.
old_name True Normalizes old names. Set to False to disable this feature.
misuse True Normalizes misused terms. Set to False to disable this feature.

b Group: Abbreviation Normalization

Controls normalization of abbreviations using the following settings:

Setting Default Value Description
alphabetic_abbreviation True Normalizes abbreviations written in alphabetic characters. Set to False to disable this feature.
non_alphabetic_abbreviation True Normalizes abbreviations written in non-alphabetic characters (e.g., Japanese). Set to False to disable this feature.

c Group: Orthographic Normalization

Controls normalization of orthographic variations and errors using the following settings:

Setting Default Value Description
alphabet True Normalizes alphabetic variations. Set to False to disable this feature.
orthographic_variation True Normalizes orthographic variations. Set to False to disable this feature.
misspelling True Normalizes misspellings. Set to False to disable this feature.

5. custom_synonym (Custom Dictionary)

If you want to use a custom dictionary, control its behavior with the following setting:

Setting Default Value Description
custom_synonym True Enables the use of a custom dictionary. Set to False to disable it.

This hierarchical configuration allows for flexible normalization by defining the scope and target in detail.

Specifying SudachiDict

The length of text segmentation varies depending on the type of SudachiDict. Default is "full", but you can specify "small" or "core".
To use "small" or "core", install it and specify in the SynonymNormalizer() arguments:

pip install sudachidict_small
# or
pip install sudachidict_core
normalizer = SynonymNormalizer(sudachi_dict="small")
# or
normalizer = SynonymNormalizer(sudachi_dict="core")

※ Please refer to SudachiDict documentation for details.

Custom Dictionary Specification

You can specify your own custom dictionary.
If the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.

Custom Dictionary Format

Create a JSON file with the following format for your custom dictionary:

{
    "representative_word1": ["synonym1_1", "synonym1_2", ...],
    "representative_word2": ["synonym2_1", "synonym2_2", ...],
    ...
}

Example

If you create a file like this, "幽白", "ゆうはく", and "幽☆遊☆白書" will be normalized to "幽遊白書":

{
    "幽遊白書": ["幽白", "ゆうはく", "幽☆遊☆白書"]
}

How to Specify

normalizer = SynonymNormalizer(custom_synonyms_file="path/to/custom_dict.json")

License

This project is licensed under the Apache License 2.0.

Open Source Software Used

This library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.

For detailed license information, please check the LICENSE files of each project:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yurenizer-0.1.6.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

yurenizer-0.1.6-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file yurenizer-0.1.6.tar.gz.

File metadata

  • Download URL: yurenizer-0.1.6.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.5.0-1025-azure

File hashes

Hashes for yurenizer-0.1.6.tar.gz
Algorithm Hash digest
SHA256 e5d488b8dd3f388826a0d9fe37acb1fe139fb111198ae35c4dae20ce7a1a8211
MD5 fe5ad6dc4ed27ddde790dad1dad14079
BLAKE2b-256 702e482dad8f446fb38bcfd72c56c074c7a07f754d9a4b593e896322c0681e86

See more details on using hashes here.

File details

Details for the file yurenizer-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: yurenizer-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.5.0-1025-azure

File hashes

Hashes for yurenizer-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 8da697f0142170a2f2fde37c2b2e9b21ffddf2117f634184fbece99e83b3d3b7
MD5 8818b8cdf1387f08dd8d27ad76fd7c3c
BLAKE2b-256 6c7ff775557f456a31d97870619359031c4c6be627b56af47d86647c8a8a9534

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page