Japanese tokenizer for Transformers.
Project description
Sudachi for Transformers (chiTra)
chiTra is a Japanese tokenizer for Transformers.
chiTra stands for Sudachi for Transformers.
Quick Tour
>>> from transformers import BertModel
>>> from sudachitra import BertSudachipyTokenizer
>>> tokenizer = BertSudachipyTokenizer.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model = BertModel.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state
Pre-trained BERT models and tokenizer are coming soon!
Installation
$ pip install sudachitra
The default Sudachi dictionary is SudachiDict-core. You can use other dictionaries, such as SudachiDict-small and SudachiDict-full. In such cases, you need to install the dictionaries.
$ pip install sudachidict_small sudachidict_full
Pretraining
Please refer to pretraining/bert/README.md.
Roadmap
- Releasing pre-trained models for BERT
- Adding tests
- Updating documents
For Developers
TBD
Contact
Sudachi and SudachiTra are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file SudachiTra-0.1.3.tar.gz
.
File metadata
- Download URL: SudachiTra-0.1.3.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd9cc2ec2615fe844737e840548b6a9c98ba386d213e8cb3612e25b6220496ea |
|
MD5 | 87d46fe2e41bb8f388139dedbd11a6e1 |
|
BLAKE2b-256 | 2a8863aa4fa9235af5dbb0da3eec8f5f896c7e0f3118742a7cb5077802b8b643 |