Skip to main content

A Japanese text tokenizer with POS tagging and Jisho.org integration.

Project description

KanaSplit - Japanese Text Tokenizer

KanaSplit Logo
A powerful and efficient Japanese text tokenizer with POS tagging and Jisho.org integration.

📌 Overview

KanaSplit is a Japanese text tokenizer designed to break down Japanese sentences into meaningful tokens while providing part-of-speech (POS) tagging. It integrates with Jisho.org to fetch additional lexical data for individual words. The tool is built using MeCab, a popular morphological analyzer for the Japanese language.

🚀 Features

  • 🔹 Tokenization: Splits Japanese sentences into words and morphemes.
  • 🔹 POS Tagging: Provides grammatical category for each token.
  • 🔹 Furigana Support: Extracts readings for kanji words.
  • 🔹 Jisho.org API Integration: Retrieves word meanings and definitions.
  • 🔹 Command-Line Interface (CLI): Allows easy text tokenization from the terminal.
  • 🔹 Error Logging: Captures and logs API failures for debugging.

🔧 Installation

KanaSplit requires Python 3.6+ and MeCab to be installed.

1️⃣ Install Dependencies

pip install -r requirements.txt

2️⃣ Install MeCab (if not installed)

For Windows:

choco install mecab mecab-ipadic

For macOS:

brew install mecab mecab-ipadic

For Linux (Ubuntu/Debian):

sudo apt install mecab mecab-ipadic

3️⃣ Install KanaSplit

pip install .

📖 Usage

1️⃣ Tokenize Japanese Text

kanasplit-cli "私はお寿司を食べたいです。"

Output:

- 私 (名詞)
- は (助詞)
- お (接頭詞)
- 寿司 (名詞)
- を (助詞)
- 食べ (動詞)
- たい (助動詞)
- です (助動詞)
- 。 (記号)

2️⃣ Fetch Word Definitions from Jisho.org

kanasplit-cli -w "寿司"

Output:

Word: 寿司
Reading: すし
Meanings: ['sushi', 'range of dishes made with vinegared rice combined with fish, vegetables, egg, etc.']

🛠 API Usage

You can also use KanaSplit in your Python scripts:

from tokenizer import tokenize_text_with_pos, fetch_word_from_jisho

text = "ハサミを買いたいんですが、文房具売り場は何回ですか?"
tokens = tokenize_text_with_pos(text)
print(tokens)

word_data = fetch_word_from_jisho("寿司")
print(word_data)

📝 Configuration

KanaSplit logs errors in errors.log. You can configure logging settings in tokenizer.py.

🏗️ Development & Contribution

Want to contribute? Feel free to fork this repository and submit a pull request!

git clone https://github.com/byteMe394/KanaSplit.git
cd KanaSplit

📜 License

This project is licensed under the MIT License.

👨‍💻 Author

José Trujillo
GitHub | Email

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kanasplit-1.0.0.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

KanaSplit-1.0.0-py3-none-any.whl (3.0 kB view details)

Uploaded Python 3

File details

Details for the file kanasplit-1.0.0.tar.gz.

File metadata

  • Download URL: kanasplit-1.0.0.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for kanasplit-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2733a1fb09f57bfc32f78e8e60b90781f473b750ec5d39d9bca918e946057e6c
MD5 474433238403667b97d897bb12fee54f
BLAKE2b-256 818dbd8ff5c1dde12682ea40b9030f6db0eae461fb6d85b026619bccf48073fb

See more details on using hashes here.

File details

Details for the file KanaSplit-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: KanaSplit-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 3.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for KanaSplit-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df2b885c9a92018d0e9885df02a9492d8bd3ccad3372293a80a96623b92f0291
MD5 5ba9121aa262c83629d8af4a1a64467f
BLAKE2b-256 2d9af4a41218031de32226623231b22a2acee6cdd25b5efd6d86478478f5a3b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page