A BERT model for nagisa
Project description
nagisa_bert
This library provides a tokenizer to use a Japanese BERT model for nagisa. The model is available in Transformers 🤗.
You can try fill-mask using nagisa_bert at Hugging Face Space.
Install
Python 3.7+ on Linux or macOS is required. You can install nagisa_bert by using the pip command.
$ pip install nagisa_bert
Usage
This model is available in Transformer's pipeline method.
from transformers import pipeline
from nagisa_bert import NagisaBertTokenizer
text = "nagisaで[MASK]できるモデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
print(fill_mask(text))
[{'score': 0.1385931372642517,
'sequence': 'nagisa で 使用 できる モデル です',
'token': 8092,
'token_str': '使 用'},
{'score': 0.11947669088840485,
'sequence': 'nagisa で 利用 できる モデル です',
'token': 8252,
'token_str': '利 用'},
{'score': 0.04910655692219734,
'sequence': 'nagisa で 作成 できる モデル です',
'token': 9559,
'token_str': '作 成'},
{'score': 0.03792576864361763,
'sequence': 'nagisa で 購入 できる モデル です',
'token': 9430,
'token_str': '購 入'},
{'score': 0.026893319562077522,
'sequence': 'nagisa で 入手 できる モデル です',
'token': 11273,
'token_str': '入 手'}]
Tokenization and vectorization.
from transformers import BertModel
from nagisa_bert import NagisaBertTokenizer
text = "nagisaで[MASK]できるモデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
tokens = tokenizer.tokenize(text)
print(tokens)
# ['na', '##g', '##is', '##a', 'で', '[MASK]', 'できる', 'モデル', 'です']
model = BertModel.from_pretrained("taishi-i/nagisa_bert")
h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
print(h)
tensor([[[-0.2912, -0.6818, -0.4097, ..., 0.0262, -0.3845, 0.5816],
[ 0.2504, 0.2143, 0.5809, ..., -0.5428, 1.1805, 1.8701],
[ 0.1890, -0.5816, -0.5469, ..., -1.2081, -0.2341, 1.0215],
...,
[-0.4360, -0.2546, -0.2824, ..., 0.7420, -0.2904, 0.3070],
[-0.6598, -0.7607, 0.0034, ..., 0.2982, 0.5126, 1.1403],
[-0.2505, -0.6574, -0.0523, ..., 0.9082, 0.5851, 1.2625]]],
grad_fn=<NativeLayerNormBackward0>)
Tutorial
You can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.
Notebook | Description | |
---|---|---|
Fill-mask | How to use the pipeline function in transformers to fill in Japanese text. | |
Feature-extraction | How to use the pipeline function in transformers to extract features from Japanese text. | |
Embedding visualization | Show how to visualize embeddings from Japanese pre-trained models. | |
How to fine-tune a model on text classification | Show how to fine-tune a pretrained model on a Japanese text classification task. | |
How to fine-tune a model on text classification with csv files | Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. |
Model description
Architecture
The model architecture is the same as the BERT bert-base-uncased architecture (12 layers, 768 dimensions of hidden states, and 12 attention heads).
Training Data
The models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with make_corpus_wiki.py and create_pretraining_data.py.
Training
The model is trained with the default parameters of transformers.BertConfig. Due to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nagisa_bert-0.0.4.tar.gz
.
File metadata
- Download URL: nagisa_bert-0.0.4.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3285de369bbad15a622a580903f474c29cd72d79d2ecce83c5dc9bb4f78538ee |
|
MD5 | 984539f10642ec532489d567ef81837e |
|
BLAKE2b-256 | 3cee4d9590daf6d4410fe4db8333127ea12e017d1a2dd73724b18ef67d804249 |
File details
Details for the file nagisa_bert-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: nagisa_bert-0.0.4-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d18fa156c73bd75b7371e25cfe1ec46dc4ffc45492cb0ab37d833203a41a82f |
|
MD5 | b47efd2b48904a0d594b716d87484ad5 |
|
BLAKE2b-256 | ade3dc9ff6d40b8430d1c453bfa8d2717845f1049f01cb828e781ee9490b9970 |