Japanese tokenizer for Transformers.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Sudachi for Transformers (chiTra)

chiTra is a Japanese tokenizer for Transformers.

chiTra stands for Sudachi for Transformers.

Quick Tour

>>> from transformers import BertModel
>>> from sudachitra import BertSudachipyTokenizer

>>> tokenizer = BertSudachipyTokenizer.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model = BertModel.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state

Pre-trained BERT models and tokenizer are coming soon!

Installation

$ pip install sudachitra

The default Sudachi dictionary is SudachiDict-core. You can use other dictionaries, such as SudachiDict-small and SudachiDict-full. In such cases, you need to install the dictionaries.

$ pip install sudachidict_small sudachidict_full

Pretraining

Please refer to pretraining/bert/README.md.

Roadmap

Releasing pre-trained models for BERT
Adding tests
Updating documents

For Developers

TBD

Contact

Sudachi and SudachiTra are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.9

Dec 18, 2023

0.1.8

Mar 17, 2023

This version

0.1.7

Dec 27, 2021

0.1.6

Nov 17, 2021

0.1.5

Aug 23, 2021

0.1.4

Aug 15, 2021

0.1.3

Aug 15, 2021

0.1.2

Jul 14, 2021

0.1.1

Jul 12, 2021

0.1.0

Jun 25, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiTra-0.1.7.tar.gz (283.7 kB view hashes)

Uploaded Dec 27, 2021 Source

Hashes for SudachiTra-0.1.7.tar.gz

Hashes for SudachiTra-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`b95658c2c8f29c7887ef1614c0289d3710e09247fb2b4f59243bc127b58d3742`
MD5	`992fcf90889d143dceec10b22bf69a9e`
BLAKE2b-256	`90878611d9abc505a3edd0a1714ffc23945a90fabaf118f10b2f60c06e536883`