Skip to main content

Japanese tokenizer for Transformers.

Project description

Sudachi for Transformers (chiTra)

test

chiTra is a Japanese tokenizer for Transformers.

chiTra stands for Sudachi for Transformers.

Quick Tour

>>> from transformers import BertModel
>>> from sudachitra import BertSudachipyTokenizer

>>> tokenizer = BertSudachipyTokenizer.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model = BertModel.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state

Pre-trained BERT models and tokenizer are coming soon!

Installation

$ pip install sudachitra

The default Sudachi dictionary is SudachiDict-core. You can use other dictionaries, such as SudachiDict-small and SudachiDict-full. In such cases, you need to install the dictionaries.

$ pip install sudachidict_small sudachidict_full

Pretraining

Please refer to pretraining/bert/README.md.

Roadmap

  • Releasing pre-trained models for BERT
  • Adding tests
  • Updating documents

For Developers

TBD

Contact

Sudachi and SudachiTra are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiTra-0.1.1.tar.gz (29.1 kB view details)

Uploaded Source

File details

Details for the file SudachiTra-0.1.1.tar.gz.

File metadata

  • Download URL: SudachiTra-0.1.1.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.5

File hashes

Hashes for SudachiTra-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c4878c7a1e02c1040e2520925ac3ea2895dc15aa5d2628142d39773f5c27910c
MD5 2412cb953262ab5985fbdb531add4218
BLAKE2b-256 643e7c3be9b9b828185a5a5f46434ce0fab04fbbb99361dd625308b4bce86951

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page