Skip to main content

Japanese tokenizer for Transformers.

Project description

Sudachi for Transformers (chiTra)

test

chiTra is a Japanese tokenizer for Transformers.

chiTra stands for Sudachi for Transformers.

Quick Tour

>>> from transformers import BertModel
>>> from sudachitra import BertSudachipyTokenizer

>>> tokenizer = BertSudachipyTokenizer.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model = BertModel.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state

Pre-trained BERT models and tokenizer are coming soon!

Installation

$ pip install sudachitra

The default Sudachi dictionary is SudachiDict-core. You can use other dictionaries, such as SudachiDict-small and SudachiDict-full. In such cases, you need to install the dictionaries.

$ pip install sudachidict_small sudachidict_full

Pretraining

Please refer to pretraining/bert/README.md.

Roadmap

  • Releasing pre-trained models for BERT
  • Adding tests
  • Updating documents

For Developers

TBD

Contact

Sudachi and SudachiTra are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiTra-0.1.5.tar.gz (29.6 kB view details)

Uploaded Source

File details

Details for the file SudachiTra-0.1.5.tar.gz.

File metadata

  • Download URL: SudachiTra-0.1.5.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.6

File hashes

Hashes for SudachiTra-0.1.5.tar.gz
Algorithm Hash digest
SHA256 6d8db1d8a903e014505c526f79e81a3fb4baa18763f9027db47e68f4f89ec7d2
MD5 13285058c39e43700553dc8fe591d6d5
BLAKE2b-256 ff2d86ec7799b8e2a1a5c4e1471d1aa2640e5618e6770910a1e2a7b756a1d6fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page