Japanese tokenizer for Transformers.
Sudachi for Transformers (chiTra)
chiTra is a Japanese tokenizer for Transformers.
chiTra stands for Sudachi for Transformers.
>>> from transformers import BertModel >>> from sudachitra import BertSudachipyTokenizer >>> tokenizer = BertSudachipyTokenizer.from_pretrained('sudachitra-bert-base-japanese-sudachi') >>> model = BertModel.from_pretrained('sudachitra-bert-base-japanese-sudachi') >>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state
Pre-trained BERT models and tokenizer are coming soon!
$ pip install sudachitra
$ pip install sudachidict_small sudachidict_full
Please refer to pretraining/bert/README.md.
- Releasing pre-trained models for BERT
- Adding tests
- Updating documents
Sudachi and SudachiTra are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.