Skip to main content

A BERT model for nagisa

Project description

nagisa_bert

Python package PyPI version

This library provides a tokenizer to use a Japanese BERT model for nagisa. The model is available in Transformers 🤗.

You can try fill-mask using nagisa_bert at Hugging Face Space.

Install

You can install nagisa_bert by using the pip command.

pip install nagisa_bert

Supported Platforms:

  • 🐧 Linux, 🍎 macOS, 🪟 Windows: Python 3.10 - 3.14

Usage

This model is available in Transformer's pipeline method.

from transformers import pipeline
from nagisa_bert import NagisaBertTokenizer

text = "nagisaで[MASK]できるモデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
print(fill_mask(text))
[{'score': 0.1385931372642517,
  'sequence': 'nagisa で 使用 できる モデル です',
  'token': 8092,
  'token_str': '使 用'},
 {'score': 0.11947669088840485,
  'sequence': 'nagisa で 利用 できる モデル です',
  'token': 8252,
  'token_str': '利 用'},
 {'score': 0.04910655692219734,
  'sequence': 'nagisa で 作成 できる モデル です',
  'token': 9559,
  'token_str': '作 成'},
 {'score': 0.03792576864361763,
  'sequence': 'nagisa で 購入 できる モデル です',
  'token': 9430,
  'token_str': '購 入'},
 {'score': 0.026893319562077522,
  'sequence': 'nagisa で 入手 できる モデル です',
  'token': 11273,
  'token_str': '入 手'}]

Tokenization and vectorization.

from transformers import BertModel
from nagisa_bert import NagisaBertTokenizer

text = "nagisaで[MASK]できるモデルです"
tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
tokens = tokenizer.tokenize(text)
print(tokens)
# ['na', '##g', '##is', '##a', 'で', '[MASK]', 'できる', 'モデル', 'です']

model = BertModel.from_pretrained("taishi-i/nagisa_bert")
h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
print(h)
tensor([[[-0.2912, -0.6818, -0.4097,  ...,  0.0262, -0.3845,  0.5816],
         [ 0.2504,  0.2143,  0.5809,  ..., -0.5428,  1.1805,  1.8701],
         [ 0.1890, -0.5816, -0.5469,  ..., -1.2081, -0.2341,  1.0215],
         ...,
         [-0.4360, -0.2546, -0.2824,  ...,  0.7420, -0.2904,  0.3070],
         [-0.6598, -0.7607,  0.0034,  ...,  0.2982,  0.5126,  1.1403],
         [-0.2505, -0.6574, -0.0523,  ...,  0.9082,  0.5851,  1.2625]]],
       grad_fn=<NativeLayerNormBackward0>)

Tutorial

You can find here a list of the notebooks on Japanese NLP using pre-trained models and transformers.

Notebook Description
Fill-mask How to use the pipeline function in transformers to fill in Japanese text. Open in Colab
Feature-extraction How to use the pipeline function in transformers to extract features from Japanese text. Open in Colab
Embedding visualization Show how to visualize embeddings from Japanese pre-trained models. Open in Colab
How to fine-tune a model on text classification Show how to fine-tune a pretrained model on a Japanese text classification task. Open in Colab
How to fine-tune a model on text classification with csv files Show how to preprocess the data and fine-tune a pretrained model on a Japanese text classification task. Open in Colab

Model description

Architecture

The model architecture is the same as the BERT bert-base-uncased architecture (12 layers, 768 dimensions of hidden states, and 12 attention heads).

Training Data

The models is trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 8, 2022 with make_corpus_wiki.py and create_pretraining_data.py.

Training

The model is trained with the default parameters of transformers.BertConfig. Due to GPU memory limitations, the batch size is set to small; 16 instances per batch, and 2M training steps.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nagisa_bert-0.0.5.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nagisa_bert-0.0.5-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file nagisa_bert-0.0.5.tar.gz.

File metadata

  • Download URL: nagisa_bert-0.0.5.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for nagisa_bert-0.0.5.tar.gz
Algorithm Hash digest
SHA256 8f8ca1b530fb4dba1501b7aacd36aa43c07d6400c5baa46eaf26a1f19b21088c
MD5 b04d61595b0afafd053aa57e4d56fb71
BLAKE2b-256 454169007c5dc65495e3bb8ba0f33c41b959fd8c7e84a92afba3c80262741f7b

See more details on using hashes here.

File details

Details for the file nagisa_bert-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: nagisa_bert-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for nagisa_bert-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 95e20184410159fd08bef2b9ca8bc3a855da9bb73491741d05cb396ad7a14fba
MD5 2ee6a7c4ebe946a60fcba16de0a4f68a
BLAKE2b-256 6dfdd470e5e40d79f4dfa32be725f61666bec3a6eba32938c12db0b8d78334e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page