A tiny sentence/word tokenizer for Japanese text written in Python
Project description
๐ฟ Konoha: Simple wrapper of Japanese Tokenizers
Konoha
is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers,
which enables you to switch a tokenizer and boost your pre-processing.
Supported tokenizers
Also, konoha
provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.
Quick Start with Docker
Simply run followings on your computer:
docker run --rm -p 8000:8000 -t himkt/konoha # from DockerHub
Or you can build image on your machine:
git clone https://github.com/himkt/konoha # download konoha
cd konoha && docker-compose up --build # build and launch container
Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize
.
You can also batch tokenize by passing texts: ["๏ผใค็ฎใฎๅ
ฅๅ", "๏ผใค็ฎใฎๅ
ฅๅ"]
to localhost:8000/api/v1/batch_tokenize
.
(API documentation is available on localhost:8000/redoc
, you can check it using your web browser)
Send a request using curl
on your terminal.
Note that a path to an endpoint is changed in v4.6.4.
Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).
$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
-d '{"tokenizer": "mecab", "text": "ใใใฏใใณใงใ"}'
{
"tokens": [
[
{
"surface": "ใใ",
"part_of_speech": "ๅ่ฉ"
},
{
"surface": "ใฏ",
"part_of_speech": "ๅฉ่ฉ"
},
{
"surface": "ใใณ",
"part_of_speech": "ๅ่ฉ"
},
{
"surface": "ใงใ",
"part_of_speech": "ๅฉๅ่ฉ"
}
]
]
}
Installation
I recommend you to install konoha by pip install 'konoha[all]'
.
- Install konoha with a specific tokenizer:
pip install 'konoha[(tokenizer_name)]
. - Install konoha with a specific tokenizer and remote file support:
pip install 'konoha[(tokenizer_name),remote]'
If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer
(e.g. konoha[mecab]
, konoha[sudachi]
, ...etc) or install tokenizers individually.
Example
Word level tokenization
from konoha import WordTokenizer
sentence = '่ช็ถ่จ่ชๅฆ็ใๅๅผทใใฆใใพใ'
tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆ, ใ, ใพใ]
tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ, ่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆใใพใ]
For more detail, please see the example/
directory.
Remote files
Konoha supports dictionary and model on cloud storage (currently supports Amazon S3).
It requires installing konoha with the remote
option, see Installation.
# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))
# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))
# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))
Sentence level tokenization
from konoha import SentenceTokenizer
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใใใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใ', 'ๅๅใชใใฆใใฎใฏใชใใ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
You can change symbols for a sentence splitter and bracket expression.
- sentence splitter
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใ๏ผใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer(period="๏ผ")
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใ๏ผ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
- bracket expression
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใใใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer(
patterns=SentenceTokenizer.PATTERNS + [re.compile(r"ใ.*?ใ")],
)
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใ', 'ๅๅใชใใฆใใฎใฏใชใใ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
Test
python -m pytest
Article
- ใใผใฏใใคใถใใใๆใใซๅใๆฟใใใฉใคใใฉใช konoha ใไฝใฃใ
- ๆฅๆฌ่ช่งฃๆใใผใซ Konoha ใซ AllenNLP ้ฃๆบๆฉ่ฝใๅฎ่ฃ ใใ
Acknowledgement
Sentencepiece model used in test is provided by @yoheikikuta. Thanks!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file konoha-5.4.0.tar.gz
.
File metadata
- Download URL: konoha-5.4.0.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.1 CPython/3.10.0 Linux/5.15.0-1023-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b29e000c802531165235b26e0a316080fbdd31695997a898c3e20a9c2f7c1a57 |
|
MD5 | 4babab57ca6539800f5b959e6a0793d7 |
|
BLAKE2b-256 | f3f1a83f55d8e7e824d483bccce3902335d130b572b5a17bde8b282acd13d504 |
File details
Details for the file konoha-5.4.0-py3-none-any.whl
.
File metadata
- Download URL: konoha-5.4.0-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.1 CPython/3.10.0 Linux/5.15.0-1023-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a25c7c38415a31cf95c9a857d0994fab0f2449003154fb3c9d4786ce18de242 |
|
MD5 | ad5791b088f8911097f5f6959480150f |
|
BLAKE2b-256 | fa7c4b6db3c11c0bebcc483c7e9056489e978e5a2bee06091ced3f7b3e79b0e3 |