A Japanese text tokenizer with POS tagging and Jisho.org integration.
Project description
KanaSplit - Japanese Text Tokenizer
A powerful and efficient Japanese text tokenizer with POS tagging and Jisho.org integration.
📌 Overview
KanaSplit is a Japanese text tokenizer designed to break down Japanese sentences into meaningful tokens while providing part-of-speech (POS) tagging. It integrates with Jisho.org to fetch additional lexical data for individual words. The tool is built using MeCab, a popular morphological analyzer for the Japanese language.
🚀 Features
- 🔹 Tokenization: Splits Japanese sentences into words and morphemes.
- 🔹 POS Tagging: Provides grammatical category for each token.
- 🔹 Furigana Support: Extracts readings for kanji words.
- 🔹 Jisho.org API Integration: Retrieves word meanings and definitions.
- 🔹 Command-Line Interface (CLI): Allows easy text tokenization from the terminal.
- 🔹 Error Logging: Captures and logs API failures for debugging.
🔧 Installation
KanaSplit requires Python 3.6+ and MeCab to be installed.
1️⃣ Install Dependencies
pip install -r requirements.txt
2️⃣ Install MeCab (if not installed)
For Windows:
choco install mecab mecab-ipadic
For macOS:
brew install mecab mecab-ipadic
For Linux (Ubuntu/Debian):
sudo apt install mecab mecab-ipadic
3️⃣ Install KanaSplit
pip install .
📖 Usage
1️⃣ Tokenize Japanese Text
kanasplit-cli "私はお寿司を食べたいです。"
Output:
- 私 (名詞)
- は (助詞)
- お (接頭詞)
- 寿司 (名詞)
- を (助詞)
- 食べ (動詞)
- たい (助動詞)
- です (助動詞)
- 。 (記号)
2️⃣ Fetch Word Definitions from Jisho.org
kanasplit-cli -w "寿司"
Output:
Word: 寿司
Reading: すし
Meanings: ['sushi', 'range of dishes made with vinegared rice combined with fish, vegetables, egg, etc.']
🛠 API Usage
You can also use KanaSplit in your Python scripts:
from tokenizer import tokenize_text_with_pos, fetch_word_from_jisho
text = "ハサミを買いたいんですが、文房具売り場は何回ですか?"
tokens = tokenize_text_with_pos(text)
print(tokens)
word_data = fetch_word_from_jisho("寿司")
print(word_data)
📝 Configuration
KanaSplit logs errors in errors.log. You can configure logging settings in tokenizer.py.
🏗️ Development & Contribution
Want to contribute? Feel free to fork this repository and submit a pull request!
git clone https://github.com/byteMe394/KanaSplit.git
cd KanaSplit
📜 License
This project is licensed under the MIT License.
👨💻 Author
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kanasplit-1.0.0.tar.gz.
File metadata
- Download URL: kanasplit-1.0.0.tar.gz
- Upload date:
- Size: 3.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2733a1fb09f57bfc32f78e8e60b90781f473b750ec5d39d9bca918e946057e6c
|
|
| MD5 |
474433238403667b97d897bb12fee54f
|
|
| BLAKE2b-256 |
818dbd8ff5c1dde12682ea40b9030f6db0eae461fb6d85b026619bccf48073fb
|
File details
Details for the file KanaSplit-1.0.0-py3-none-any.whl.
File metadata
- Download URL: KanaSplit-1.0.0-py3-none-any.whl
- Upload date:
- Size: 3.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df2b885c9a92018d0e9885df02a9492d8bd3ccad3372293a80a96623b92f0291
|
|
| MD5 |
5ba9121aa262c83629d8af4a1a64467f
|
|
| BLAKE2b-256 |
2d9af4a41218031de32226623231b22a2acee6cdd25b5efd6d86478478f5a3b0
|