Skip to main content

A library for standardizing terms with spelling variations using a synonym dictionary.

Project description

yurenizer

This is a Japanese text normalizer that resolves spelling inconsistencies.

Japanese README is Here.(日本語のREADMEはこちら)
https://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md

Overview

yurenizer is a tool for detecting and unifying variations in Japanese text notation.
For example, it can unify variations like "パソコン" (pasokon), "パーソナル・コンピュータ" (personal computer), and "パーソナルコンピュータ" into "パーソナルコンピューター".
These rules follow the Sudachi Synonym Dictionary.

Installation

pip install yurenizer

Download Synonym Dictionary

curl -L -o /path/to/synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt

Usage

Quick Start

from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="path/to/synonym_file_path")
text = "「パソコン」は「パーソナルコンピュータ」の「synonym」で、「パーソナル・コンピュータ」と表記することもあります。"
print(normalizer.normalize(text))
# Output: 「パーソナルコンピューター」は「パーソナルコンピューター」の「シノニム」で、「パーソナルコンピューター」と表記することもあります。

Customizing Settings

You can control normalization by specifying NormalizerConfig as an argument to the normalize function.

Example with Custom Settings

from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="path/to/synonym_file_path")
text = "パソコンはパーソナルコンピュータの同義語です"
config = NormalizerConfig(taigen=True, yougen=False, expansion="from_another", other_language=False, alphabet=False, alphabetic_abbreviation=False, non_alphabetic_abbreviation=False, orthographic_variation=False, misspelling=False)
print(normalizer.normalize(text, config))
# Output: パソコンはパーソナルコンピュータの同義語で、パーソナル・コンピュータと言ったりパーソナル・コンピューターと言ったりします。

Configuration Details

  • unify_level (default="lexeme"): Flag to specify unification level. Default "lexeme" unifies based on lexeme number. "word_form" option unifies based on word form number. "abbreviation" option unifies based on abbreviation number.
  • taigen (default=True): Flag to include nouns in unification. Default is to include. Specify False to exclude.
  • yougen (default=False): Flag to include conjugated words in unification. Default is to exclude. Specify True to include.
  • expansion (default="from_another"): Synonym expansion control flag. Default only expands those with expansion control flag 0. Specify "ANY" to always expand.
  • other_language (default=True): Flag to normalize non-Japanese languages to Japanese. Default is to normalize. Specify False to disable.
  • alias (default=True): Flag to normalize aliases. Default is to normalize. Specify False to disable.
  • old_name (default=True): Flag to normalize old names. Default is to normalize. Specify False to disable.
  • misuse (default=True): Flag to normalize misused terms. Default is to normalize. Specify False to disable.
  • alphabetic_abbreviation (default=True): Flag to normalize alphabetic abbreviations. Default is to normalize. Specify False to disable.
  • non_alphabetic_abbreviation (default=True): Flag to normalize Japanese abbreviations. Default is to normalize. Specify False to disable.
  • alphabet (default=True): Flag to normalize alphabet variations. Default is to normalize. Specify False to disable.
  • orthographic_variation (default=True): Flag to normalize orthographic variations. Default is to normalize. Specify False to disable.
  • misspelling (default=True): Flag to normalize misspellings. Default is to normalize. Specify False to disable.
  • custom_synonym (default=True): Flag to use user-defined custom synonyms. Default is to use. Specify False to disable.

Specifying SudachiDict

The length of text segmentation varies depending on the type of SudachiDict. Default is "full", but you can specify "small" or "core".
To use "small" or "core", install it and specify in the SynonymNormalizer() arguments:

pip install sudachidict_small
# or
pip install sudachidict_core
normalizer = SynonymNormalizer(sudachi_dict="small")
# or
normalizer = SynonymNormalizer(sudachi_dict="core")

※ Please refer to SudachiDict documentation for details.

Custom Dictionary Specification

You can specify your own custom dictionary.
If the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.

Custom Dictionary Format

Create a JSON file with the following format for your custom dictionary:

{
    "representative_word1": ["synonym1_1", "synonym1_2", ...],
    "representative_word2": ["synonym2_1", "synonym2_2", ...],
    ...
}

Example

If you create a file like this, "幽白", "ゆうはく", and "幽☆遊☆白書" will be normalized to "幽遊白書":

{
    "幽遊白書": ["幽白", "ゆうはく", "幽☆遊☆白書"]
}

How to Specify

normalizer = SynonymNormalizer(custom_synonyms_file="path/to/custom_dict.json")

License

This project is licensed under the Apache License 2.0.

Open Source Software Used

This library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.

For detailed license information, please check the LICENSE files of each project:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yurenizer-0.1.1.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

yurenizer-0.1.1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file yurenizer-0.1.1.tar.gz.

File metadata

  • Download URL: yurenizer-0.1.1.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.5.0-1025-azure

File hashes

Hashes for yurenizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 86a9be43f13b9c7ea6664edb42ffa8e6045c6bf0483c69345e80ddddc1811520
MD5 d209283fdaa6a11d73b0257f4a4deab1
BLAKE2b-256 8f28856eca72aa02ac73275c7203a6c52ee60c1cc37bcd9ca8f703f47b6621a8

See more details on using hashes here.

Provenance

File details

Details for the file yurenizer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: yurenizer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.11.10 Linux/6.5.0-1025-azure

File hashes

Hashes for yurenizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 81ba1fb2475e5dbc8093a447c2593ab6a2479c303c3e182be88bdda71d24349c
MD5 c15bb5a662737f4cec14b6d2e9e7c74a
BLAKE2b-256 5d7c2b5fabe1747094173e0b4b7ea270ff6542cee1c609ed0fba9ab80ca34273

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page