Skip to main content

Allow for easy preprocessing of Japanese text using NER/NLP from spaCy

Reason this release was yanked:

Does not Install dependencies properly, see (v1.4.0)

Project description


Table of Contents


Quick Start

To get started with Kairyou, install the package via pip:

pip install kairyou

Then, you can preprocess Japanese text by importing Kairyou and/or KatakanaUtil as follows:

from kairyou import Kairyou, KatakanaUtil

Follow the usage examples provided in the Usage section for detailed instructions on preprocessing text and handling katakana.


Installation

Pretty sure it requires 3.8 or higher, I haven't tested it on anything lower. 3.7 might work, but I'm not sure. Feedback is welcome.

Kairyou can be installed using pip:

pip install kairyou

This will install Kairyou along with its dependencies, including spaCy and a few other packages.

These are the dependencies that will be installed:

setuptools>=61.0

wheel

setuptools_scm>=6.0

tomli

spacy>=3.7.0,<3.8.0

ja_core_news_lg @ https://github.com/explosion/spacy-models/releases/download/ja_core_news_lg-3.7.0/ja_core_news_lg-3.7.0-py3-none-any.whl

Usage


Kairyou

Kairyou simplifies the preprocessing of Japanese text for NLP and NER tasks. Here's an example of how to use it:

from kairyou import Kairyou

text = "Your Japanese text here."
replacement_json = "path/to/your/replacement_rules.json"  ## or a dict of rules
preprocessed_text, log, error_log = Kairyou.preprocess(text, replacement_json)

print(preprocessed_text)

Kairyou is mostly just preprocess() there are other functions available, but they are not intended for direct use. The preprocess() function takes in a string of Japanese text and a JSON file or dictionary of replacement rules. It returns the preprocessed text, a log of the replacements made, and a log of any errors that occurred during the preprocessing (typically none).

Note that rules must follow the format of the example JSON file Blank Format JSON. You can also look at COTE Replacements JSON for an example of one that is filled out.

KatakanaUtil

KatakanaUtil provides utility functions for handling katakana characters in Japanese text. Example usage:

from kairyou import KatakanaUtil

katakana_word = "カタカナ"
if KatakanaUtil.is_katakana_only(katakana_word):
    print(f"{katakana_word} is composed only of Katakana characters.")
The following functions are available in KatakanaUtil:

is_katakana_only: Returns True if the input string is composed only of katakana characters.

is_actual_word: Returns True if the input string is a actual Japanese Katakana word (not just something made up or a name). List of words can be found [here](src/kairyou/words.py).

is_punctuation: Returns True if the input string is punctuation (Both Japanese and English punctuation are supported). List of punctuation can be found [here](src/kairyou/katakana_util.py).

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.


Contact

If you have any questions or suggestions, feel free to reach out to me at Tetralon07@gmail.com.

Also feel free to check out the GitHub repository for this project.

Or the issue tracker here.


Contribution

Contributions are welcome! I don't have a specific format for contributions, but please feel free to submit a pull request or open an issue if you have any suggestions or improvements.


Notes

Kairyou was originally developed as a part of Kudasai, a Japanese preprocessor turned Machine Translator. It was later split off into its own package to be used independently of Kudasai for multiple reasons.

Kairyou gets its name from the Japanese word "Reform" (改良) which is pronounced "Kairyou". Which was chosen for two reasons, the first being that it was chosen during a large Kudasai rework, and the second being that it is a Japanese preprocessor, and the name "Reform" seemed fitting.

This package is also my first serious attempt at creating a Python package, so I'm sure there are some things that could be improved. Feedback is welcomed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kairyou-1.0.1.tar.gz (94.4 kB view hashes)

Uploaded Source

Built Distribution

kairyou-1.0.1-py3-none-any.whl (81.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page