Skip to main content

A BERT model for nagisa: It is created to be robust against typos and colloquial expressions for Japanese.

Project description

nagisa_bert

This library provides a tokenizer to use the Japanese BERT model for nagisa. The nagisa BERT model is created to be robust against typos and colloquial expressions for Japanese.

It is trained using character and word units with Hugging Face's Transformers. Unknown words are trained on a character unit. The model is available in Transformers 🤗.

Install

Python 3.7+ on Linux or macOS is required. You can install nagisa_bert by using the pip command.

$ pip install nagisa_bert

Usage

This model is available in Transformer's pipeline method.

>>> from transformers import pipeline
>>> from nagisa_bert import NagisaBertTokenizer

>>> text = "nagisaで[MASK]できるモデルです"
>>> tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
>>> fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
>>> print(fill_mask(text))
[{'score': 0.1437765508890152,
  'sequence': 'n a g i s a で 使用 できる モデル です',
  'token': 1104,
  'token_str': '使 用'},
 {'score': 0.08369122445583344,
  'sequence': 'n a g i s a で 購入 できる モデル です',
  'token': 1821,
  'token_str': '購 入'},
 {'score': 0.07685843855142593,
  'sequence': 'n a g i s a で 利用 できる モデル です',
  'token': 548,
  'token_str': '利 用'},
 {'score': 0.07316956669092178,
  'sequence': 'n a g i s a で 閲覧 できる モデル です',
  'token': 13270,
  'token_str': '閲 覧'},
 {'score': 0.05647417902946472,
  'sequence': 'n a g i s a で 確認 できる モデル です',
  'token': 1368,
  'token_str': '確 認'}]

Tokenization and vectorization.

>>> from transformers import BertModel
>>> from nagisa_bert import NagisaBertTokenizer

>>> text = "nagisaで[MASK]できるモデルです"
>>> tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
>>> tokens = tokenizer.tokenize(text)
>>> print(tokens)
['n', 'a', 'g', 'i', 's', 'a', 'で', '[MASK]', 'できる', 'モデル', 'です']

>>> model = BertModel.from_pretrained("taishi-i/nagisa_bert")
>>> h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
>>> print(h)
tensor([[[-1.1636, -0.5645,  0.4484,  ..., -0.2207, -0.1540,  0.1051],
         [-1.0394,  0.8815, -0.8070,  ...,  1.0930,  0.2069,  0.9613],
         [-0.2068, -0.1445, -0.6113,  ..., -1.2920,  0.0725, -0.2164],
         ...,
         [-1.2590,  0.0118,  0.4998,  ..., -0.5212, -0.8015, -0.1050],
         [ 0.7925, -0.7628,  0.1016,  ...,  0.2233,  0.0164,  0.0102],
         [-0.7847, -0.1375,  0.4475,  ..., -0.4014,  0.0346,  0.3157]]],
       grad_fn=<NativeLayerNormBackward0>)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nagisa_bert-0.0.1.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

nagisa_bert-0.0.1-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file nagisa_bert-0.0.1.tar.gz.

File metadata

  • Download URL: nagisa_bert-0.0.1.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.5

File hashes

Hashes for nagisa_bert-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5441cccc0d134ec85aaccbb48bc913bbf096b38aff778beb33870198b18c7b2e
MD5 b3a2c34221bc3fa898397edc6bd5e030
BLAKE2b-256 7ce8c5fce470c44e1f45678f37ddb2f7f08b300869f22b78d76a0529b552e77f

See more details on using hashes here.

File details

Details for the file nagisa_bert-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: nagisa_bert-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.5

File hashes

Hashes for nagisa_bert-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dd668c737d0e059e883b79fc65fb5953ed57a34c1c4d2ee6305eb9e7a0c1203a
MD5 4cef6d473e02d5c2a0d5274ad88e8461
BLAKE2b-256 a0dd3b76d0ea04aafd1a04317540ff707b3e69ed45ff76d460c4dd89a8e6cd19

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page