Skip to main content

Quickly preprocesses Japanese text using NLP/NER from SpaCy for Japanese translation or other NLP tasks.

Reason this release was yanked:

Does not Install dependencies properly, see (v1.4.0)

Project description


Table of Contents


Kairyou

Quickly preprocesses Japanese text using NLP/NER from SpaCy for Japanese translation or other NLP tasks.


Quick Start

To get started with Kairyou, install the package via pip:

pip install kairyou

Then, you can preprocess Japanese text by importing Kairyou and/or KatakanaUtil/Indexer:

from kairyou import Kairyou, KatakanaUtil, Indexer

Follow the usage examples provided in the Usage section for detailed instructions on preprocessing text and handling katakana.


Installation

Python 3.8 or higher, I haven't tested it on anything lower. 3.7 might work, but I'm not sure. Feedback is welcome.

Kairyou can be installed using pip:

pip install kairyou

This will install Kairyou along with its dependencies, including spaCy and a few other packages.

These are the dependencies that will be installed:

setuptools>=61.0

wheel

setuptools_scm>=6.0

tomli

spacy>=3.7.0,<3.8.0

ja_core_news_lg @ https://github.com/explosion/spacy-models/releases/download/ja_core_news_lg-3.7.0/ja_core_news_lg-3.7.0-py3-none-any.whl

Usage


Kairyou

Kairyou is the global preprocessor client. Here's an example of how to use it:

from kairyou import Kairyou

text = "Your Japanese text here."
replacement_json = "path/to/your/replacement_rules.json"  ## or a dict of rules
preprocessed_text, preprocessing_log, error_log = Kairyou.preprocess(text, replacement_json)

print(preprocessed_text)

Kairyou is mostly just preprocess(), but there are other functions available, however they are not intended for direct use. The preprocess() function takes in a string of Japanese text and a path to JSON file or dictionary of replacement rules. It returns the preprocessed text, a log of the replacements made, and a log of any errors that occurred during the preprocessing (typically none).

Currently, Kairyou supports two json types, "Kudasai" and "Fukuin". "Kudasai" is the native type and originated from that program, Fukuin is what the original onegai program used, as well as what the kroatoan's Fukuin program uses. No major differences in replacement are present between the two.

Blank Kudasai Json

Example Kudasai Json

Blank Fukuin Json

Example Fukuin Json


KatakanaUtil

KatakanaUtil provides utility functions for handling katakana characters in Japanese text. Example usage:

from kairyou import KatakanaUtil

katakana_word = "カタカナ"
if KatakanaUtil.is_katakana_only(katakana_word):
    print(f"{katakana_word} is composed only of Katakana characters.")

The following functions are available in KatakanaUtil:

is_katakana_only: Returns True if the input string is composed only of katakana characters.

is_actual_word: Returns True if the input string is a actual Japanese Katakana word (not just something made up or a name). List of words can be found here.

is_punctuation: Returns True if the input string is punctuation (Both Japanese and English punctuation are supported). List of punctuation can be found here.

is_repeating_sequence: Returns True if the input string is just a repeating sequence of characters. (e.g. "ジロジロ")

more_punctuation_than_japanese: Returns True if the input string has more punctuation than Japanese characters.


Indexer

Indexer is for "indexing" Japanese text. What this means is that, given input_text, a knowledge_base, and a replacements_json. It will return a list of new "names", and the occurrence which was flagged.

What is considered a name is a bit complicated. But:

  1. Must have the "person" label when using spaCy's NER.
  2. Cannot have more punctuation than Japanese characters.
  3. Cannot be a repeating sequence of characters.
  4. Cannot be an actual Japanese Katakana word.

So, it'll return names that don't in the other texts.

This can be done via index()

from kairyou import Indexer

input_text = "Your Japanese text here." ## or a path to a text file
knowledge_base = ["more Japanese text here.", "even_more_japanese_text_here"] ## or a path to a text file or directory full of text files
replacements_json = "path/to/your/replacement_rules.json"  ## or a dict of rules

NamesAndOccurrences = Indexer.index(input_text, knowledge_base, replacements_json)

NamesAndOccurrences is a list of named tuples, with the following fields:

  1. name: The name that was found.
  2. occurrence: The occurrence of the name in the input_text.

Index works with both Fukuin and Kudasai jsons.


License

This project (Kairyou) is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

The GPL is a copyleft license that promotes the principles of open-source software. It ensures that any derivative works based on this project must also be distributed under the same GPL license. This license grants you the freedom to use, modify, and distribute the software.

Please note that this information is a brief summary of the GPL. For a detailed understanding of your rights and obligations under this license, please refer to the full license text.


Contact

If you have any questions or suggestions, feel free to reach out to me at Tetralon07@gmail.com.

Also feel free to check out the GitHub repository for this project.

Or the issue tracker here.


Contribution

Contributions are welcome! I don't have a specific format for contributions, but please feel free to submit a pull request or open an issue if you have any suggestions or improvements.


Notes

Kairyou was originally developed as a part of Kudasai, a Japanese preprocessor later turned Machine Translator. It was later split off into its own package to be used independently of Kudasai for multiple reasons.

Kairyou gets its name from the Japanese word "Reform" (改良) which is pronounced "Kairyou". Which was chosen for two reasons, the first being that it was chosen during a large Kudasai rework, and the second being that it is a Japanese preprocessor, and the name seemed fitting.

This package is also my first serious attempt at creating a Python package, so I'm sure there are some things that could be improved. Feedback is welcomed.


Inspirations

Kudasai and by extension Kairyou was originally derived from Void's Script later Onegai

Kairyou also took some inspiration from Fukuin and it's approach with Katakana.

Thanks to all of the above for the inspiration and the work they put into their projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kairyou-1.2.1.tar.gz (90.6 kB view details)

Uploaded Source

Built Distribution

kairyou-1.2.1-py3-none-any.whl (76.3 kB view details)

Uploaded Python 3

File details

Details for the file kairyou-1.2.1.tar.gz.

File metadata

  • Download URL: kairyou-1.2.1.tar.gz
  • Upload date:
  • Size: 90.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for kairyou-1.2.1.tar.gz
Algorithm Hash digest
SHA256 7078780a9c6cbc5747539a91fd1f3dea98c8713ef3d3f66fc55a1161543260ff
MD5 51931fe13a9bd069f95616795b2d651f
BLAKE2b-256 5b2b09fb2862d407dfd533a6b0da53acafbcacb82740b06bafd2da42832f28f5

See more details on using hashes here.

File details

Details for the file kairyou-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: kairyou-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 76.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for kairyou-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6f3cde4c750e4c06c6256cbf1ddc43f6b34b74b2227745613d6a4525685732af
MD5 ae36adb56523a3997bf1522a737ff451
BLAKE2b-256 aa965a1649ab2a1c47756f3c0722cb07b20b6fd4e10f235b06d95711590d7d05

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page