Japanese tokenizer for Transformers.

Project description

Sudachi Transformers (chiTra)

chiTraは事前学習済みの大規模な言語モデルと Transformers 向けの日本語形態素解析器を提供します。 / chiTra provides the pre-trained language models and a Japanese tokenizer for Transformers.

chiTraはSudachi Transformersの略称です。 / chiTra stands for Sudachi Transformers.

事前学習済みモデル / Pretrained Model

公開データは Open Data Sponsorship Program を使用してAWSでホストされています。 / Datas are generously hosted by AWS with their Open Data Sponsorship Program.

Version	Normalized	SudachiTra	Sudachi	SudachiDict	Text	Pretrained Model
v1.0	normalized_and_surface	v0.1.7	0.6.2	20211220-core	NWJC (109GB)	395 MB (tar.gz)
v1.1	normalized_nouns	v0.1.8	0.6.6	20220729-core	NWJC with additional cleaning (79GB)	396 MB (tar.gz)

特長 / Features

大規模テキストによる学習 / Training on large texts
- 国語研日本語ウェブコーパス (NWJC) をつかってモデルを学習することで多様な表現とさまざまなドメインに対応しています / Models are trained on NINJAL Web Japanese Corpus (NWJC) to support a wide variety of expressions and domains.
Sudachi の利用 / Using Sudachi
- 形態素解析器 Sudachi を利用することで表記ゆれによる弊害を抑えています / By using the morphological analyzer Sudachi, reduce the negative effects of various notations.

chiTraの使い方 / How to use chiTra

クイックツアー / Quick Tour

事前準備 / Requirements

$ pip install sudachitra
$ wget https://sudachi.s3.ap-northeast-1.amazonaws.com/chitra/chiTra-1.1.tar.gz
$ tar -zxvf chiTra-1.1.tar.gz

モデルの読み込み / Load the model

>>> from sudachitra.tokenization_bert_sudachipy import BertSudachipyTokenizer
>>> from transformers import BertModel

>>> tokenizer = BertSudachipyTokenizer.from_pretrained('chiTra-1.1')
>>> tokenizer.tokenize("選挙管理委員会とすだち")
['選挙', '##管理', '##委員会', 'と', '酢', '##橘']

>>> model = BertModel.from_pretrained('chiTra-1.1')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state
tensor([[[ 0.8583, -1.1752, -0.7987,  ..., -1.1691, -0.8355,  3.4678],
         [ 0.0220,  1.1702, -2.3334,  ...,  0.6673, -2.0774,  2.7731],
         [ 0.0894, -1.3009,  3.4650,  ..., -0.1140,  0.1767,  1.9859],
         ...,
         [-0.4429, -1.6267, -2.1493,  ..., -1.7801, -1.8009,  2.5343],
         [ 1.7204, -1.0540, -0.4362,  ..., -0.0228,  0.5622,  2.5800],
         [ 1.1125, -0.3986,  1.8532,  ..., -0.8021, -1.5888,  2.9520]]],
       grad_fn=<NativeLayerNormBackward0>)

インストール / Installation

$ pip install sudachitra

デフォルトの Sudachi dictionary は SudachiDict-core を使用します。 / The default Sudachi dictionary is SudachiDict-core.

SudachiDict-small や SudachiDict-full など他の辞書をインストールして使用することもできます。 / You can use other dictionaries, such as SudachiDict-small and SudachiDict-full .
その場合は以下のように使いたい辞書をインストールしてください。 / In such cases, you need to install the dictionaries.
事前学習済みモデルを使いたい場合はcore辞書を使用して学習されていることに注意してください。 / If you want to use a pre-trained model, note that it is trained with SudachiDict-core.

$ pip install sudachidict_small sudachidict_full

事前学習 / Pretraining

事前学習方法の詳細は pretraining/bert/README.md を参照ください。 / Please refer to pretraining/bert/README.md.

開発者向け / For Developers

TBD

ライセンス / License

"chiTra"は Apache License, Version 2.0 で国立国語研究所及び株式会社ワークスアプリケーションズによって提供されています。 / "chiTra" is distributed by National Institute for Japanese Language and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.

連絡先 / Contact us

質問があれば、issueやslackをご利用ください。 / Open an issue, or come to our Slack workspace for questions and discussion.

開発者やユーザーの方々が質問したり議論するためのSlackワークスペースを用意しています。 / We have a Slack workspace for developers and users to ask questions and discuss. https://sudachi-dev.slack.com/ ( こちらから招待を受けてください) / https://sudachi-dev.slack.com/ (Get invitation here )

chiTraの引用 / Citing chiTra

chiTraについての論文を発表しています。 / We have published a following paper about chiTra;

勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸, 単語正規化による表記ゆれに頑健な BERT モデルの構築. 言語処理学会第28回年次大会, 2022.

chiTraを論文や書籍、サービスなどで引用される際には、以下のBibTexをご利用ください。 / When citing chiTra in papers, books, or services, please use the follow BibTex entries;

@INPROCEEDINGS{katsuta2022chitra,
    author    = {勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸},
    title     = {単語正規化による表記ゆれに頑健な BERT モデルの構築},
    booktitle = "言語処理学会第28回年次大会(NLP2022)",
    year      = "2022",
    pages     = "",
    publisher = "言語処理学会",
}

実験に使用したモデル / Model used for experiment

「単語正規化による表記ゆれに頑健なBERTモデルの構築」の実験において使用したモデルを以下で公開しています。/ The model used in the experiment of "単語正規化による表記ゆれに頑健なBERTモデルの構築" is published below.

Normalized	Text	Pretrained Model
surface	Wiki-40B	tar.gz
normalized_and_surface	Wiki-40B	tar.gz
normalized_conjugation	Wiki-40B	tar.gz
normalized	Wiki-40B	tar.gz

Enjoy chiTra!

Project details

Release history Release notifications | RSS feed

This version

0.1.9

Dec 18, 2023

0.1.8

Mar 17, 2023

0.1.7

Dec 27, 2021

0.1.6

Nov 17, 2021

0.1.5

Aug 23, 2021

0.1.4

Aug 15, 2021

0.1.3

Aug 15, 2021

0.1.2

Jul 14, 2021

0.1.1

Jul 12, 2021

0.1.0

Jun 25, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiTra-0.1.9.tar.gz (336.5 kB view details)

Uploaded Dec 18, 2023 Source

File details

Details for the file SudachiTra-0.1.9.tar.gz.

File metadata

Download URL: SudachiTra-0.1.9.tar.gz
Upload date: Dec 18, 2023
Size: 336.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for SudachiTra-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`c07b8b799a7d498c7d1bb386fdaac11c921c5fea01ba934f1745990df3c99e33`
MD5	`08c93ace994c3e0c69f7f56a9151544d`
BLAKE2b-256	`80760bd8a390e50291de2e19e3f8dadc79a256868a1b0f32b278d0ee4a6928d9`

See more details on using hashes here.

SudachiTra 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta