A tiny sentence/word tokenizer for Japanese text written in Python
Project description
🌿 Konoha: Simple wrapper of Japanese Tokenizers
Konoha
is a Python library for providing easy-to-use integrated interface of various Japanese tokenziers,
which enables you to switch a tokenizer and boost your pre-processing.
Supported tokenizers
Also, konoha
provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.
Quick Start with Docker
Simply run followings on your computer:
docker run --rm -p 8000:8000 -t himkt/konoha # from DockerHub
Or you can build image on your machine:
git clone https://github.com/himkt/konoha # download konoha
cd konoha && docker-compose up --build # build and launch contaier
Tokenization is done by posting a json object to localhost:8000/api/tokenize
.
You can also batch tokenize by passing texts: ["1つ目の入力", "2つ目の入力"]
to the server.
(API documentation is available on localhost:8000/redoc
, you can check it using your web browser)
Send a request using curl
on you terminal.
$ curl localhost:8000/api/tokenize -X POST -H "Content-Type: application/json" \
-d '{"tokenizer": "mecab", "text": "これはペンです"}'
{
"tokens": [
[
{
"surface": "これ",
"part_of_speech": "名詞"
},
{
"surface": "は",
"part_of_speech": "助詞"
},
{
"surface": "ペン",
"part_of_speech": "名詞"
},
{
"surface": "です",
"part_of_speech": "助動詞"
}
]
]
}
Installation
I recommend you to install konoha by pip install 'konoha[all]'
or pip install 'konoha[all_with_integrations]'
.
(all_with_integrations
will install AllenNLP
)
- Install konoha with a specific tokenizer:
pip install 'konoha[(tokenizer_name)]
. - Install konoha with a specific tokenizer and AllenNLP integration:
pip install 'konoha[(tokenizer_name),allennlp]
. - Install konoha with a specific tokenzier and remote file support:
pip install 'konoha[(tokenizer_name),remote]'
** Attention!! **
Currently, installing konoha with all tokenizers on Python3.8 fails. This failure happens since we can't install nagisa on Python3.8. (https://github.com/taishi-i/nagisa/issues/24) This problem is caused by DyNet dependency problem. (https://github.com/clab/dynet/issues/1616) DyNet doesn't provide wheel for Python3.8 and building DyNet from source doesn't work due to the dependency issue of DyNet.
If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer
(e.g. konoha[mecab]
, konoha[sudachi]
, ...etc) or install tokenizers individually.
Example
Word level tokenization
from konoha import WordTokenizer
sentence = '自然言語処理を勉強しています'
tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [自然, 言語, 処理, を, 勉強, し, て, い, ます]
tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [▁, 自然, 言語, 処理, を, 勉強, し, ています]
For more detail, please see the example/
directory.
Remote files
Konoha supports dictionary and model on cloud storage (currently supports Amazon S3).
It requires installing konoha with the remote
option, see Installation.
# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))
# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))
# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))
Sentence level tokenization
from konoha import SentenceTokenizer
sentence = "私は猫だ。名前なんてものはない。だが,「かわいい。それで十分だろう」。"
tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが,「かわいい。それで十分だろう」。']
AllenNLP integration
Konoha provides AllenNLP integration, it enables users to specify konoha tokenizer in a Jsonnet config file.
By running allennlp train
with --include-package konoha
, you can train a model using konoha tokenizer!
For example, konoha tokenizer is specified in xxx.jsonnet
like following:
{
"dataset_reader": {
"lazy": false,
"type": "text_classification_json",
"tokenizer": {
"type": "konoha", // <-- konoha here!!!
"tokenizer_name": "janome",
},
"token_indexers": {
"tokens": {
"type": "single_id",
"lowercase_tokens": true,
},
},
},
...
"model": {
...
},
"trainer": {
...
}
}
After finishing other sections (e.g. model config, trainer config, ...etc), allennlp train config/xxx.jsonnet --include-package konoha --serialization-dir yyy
works!
(remember to include konoha by --include-package konoha
)
For more detail, please refer my blog article (in Japanese, sorry).
Test
python -m pytest
Article
- Introducing Konoha (in Japanese): トークナイザをいい感じに切り替えるライブラリ konoha を作った
- Implementing AllenNLP integration (in Japanese): 日本語解析ツール Konoha に AllenNLP 連携機能を実装した
Acknowledgement
Sentencepiece model used in test is provided by @yoheikikuta. Thanks!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file konoha-4.6.2.tar.gz
.
File metadata
- Download URL: konoha-4.6.2.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.10 CPython/3.8.5 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5feb806be40d66a557d70440d156accb0d55851d002ce808f6e29f3d3b93de0 |
|
MD5 | 682cf6af3a47301b2ce842d5b1a13bac |
|
BLAKE2b-256 | c9787215eef5dc7486f87ca1a75018f734f7e2dc26a3ac741dba94a64df352d9 |
File details
Details for the file konoha-4.6.2-py3-none-any.whl
.
File metadata
- Download URL: konoha-4.6.2-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.10 CPython/3.8.5 Darwin/19.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e7fd97680085f26fcec1992a6e2b76870d3b88ff5ad7eceda4b4fc691fe10e8 |
|
MD5 | 62bb4b9429d63b6900e4d8fa3535f21c |
|
BLAKE2b-256 | ea0147358efec5396fc80f98273c42cbdfe7aab056252b07884ffcc0f118978f |