Skip to main content

Japanese tokenizer for Transformers.

Project description

Sudachi for Transformers (chiTra)

test

chiTra is a Japanese tokenizer for Transformers.

chiTra stands for Sudachi for Transformers.

Quick Tour

>>> from transformers import BertModel
>>> from sudachitra import BertSudachipyTokenizer

>>> tokenizer = BertSudachipyTokenizer.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model = BertModel.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state

Pre-trained BERT models and tokenizer are coming soon!

Installation

$ pip install sudachitra

The default Sudachi dictionary is SudachiDict-core. You can use other dictionaries, such as SudachiDict-small and SudachiDict-full. In such cases, you need to install the dictionaries.

$ pip install sudachidict_small sudachidict_full

Pretraining

Please refer to pretraining/bert/README.md.

Roadmap

  • Releasing pre-trained models for BERT
  • Adding tests
  • Updating documents

For Developers

TBD

Contact

Sudachi and SudachiTra are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiTra-0.1.7.tar.gz (283.7 kB view details)

Uploaded Source

File details

Details for the file SudachiTra-0.1.7.tar.gz.

File metadata

  • Download URL: SudachiTra-0.1.7.tar.gz
  • Upload date:
  • Size: 283.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for SudachiTra-0.1.7.tar.gz
Algorithm Hash digest
SHA256 b95658c2c8f29c7887ef1614c0289d3710e09247fb2b4f59243bc127b58d3742
MD5 992fcf90889d143dceec10b22bf69a9e
BLAKE2b-256 90878611d9abc505a3edd0a1714ffc23945a90fabaf118f10b2f60c06e536883

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page