Skip to main content

A BERT model for nagisa: It is created to be robust against typos and colloquial expressions for Japanese.

Project description

nagisa_bert

Python package PyPI version

This library provides a tokenizer to use the Japanese BERT model for nagisa. The nagisa BERT model is created to be robust against typos and colloquial expressions for Japanese.

It is trained using character and word units with Hugging Face's Transformers. Unknown words are trained on a character unit. The model is available in Transformers 🤗.

Install

Python 3.7+ on Linux or macOS is required. You can install nagisa_bert by using the pip command.

$ pip install nagisa_bert

Usage

This model is available in Transformer's pipeline method.

>>> from transformers import pipeline
>>> from nagisa_bert import NagisaBertTokenizer

>>> text = "nagisaで[MASK]できるモデルです"
>>> tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
>>> fill_mask = pipeline("fill-mask", model='taishi-i/nagisa_bert', tokenizer=tokenizer)
>>> print(fill_mask(text))
[{'score': 0.1437765508890152,
  'sequence': 'n a g i s a で 使用 できる モデル です',
  'token': 1104,
  'token_str': '使 用'},
 {'score': 0.08369122445583344,
  'sequence': 'n a g i s a で 購入 できる モデル です',
  'token': 1821,
  'token_str': '購 入'},
 {'score': 0.07685843855142593,
  'sequence': 'n a g i s a で 利用 できる モデル です',
  'token': 548,
  'token_str': '利 用'},
 {'score': 0.07316956669092178,
  'sequence': 'n a g i s a で 閲覧 できる モデル です',
  'token': 13270,
  'token_str': '閲 覧'},
 {'score': 0.05647417902946472,
  'sequence': 'n a g i s a で 確認 できる モデル です',
  'token': 1368,
  'token_str': '確 認'}]

Tokenization and vectorization.

>>> from transformers import BertModel
>>> from nagisa_bert import NagisaBertTokenizer

>>> text = "nagisaで[MASK]できるモデルです"
>>> tokenizer = NagisaBertTokenizer.from_pretrained("taishi-i/nagisa_bert")
>>> tokens = tokenizer.tokenize(text)
>>> print(tokens)
['n', 'a', 'g', 'i', 's', 'a', 'で', '[MASK]', 'できる', 'モデル', 'です']

>>> model = BertModel.from_pretrained("taishi-i/nagisa_bert")
>>> h = model(**tokenizer(text, return_tensors="pt")).last_hidden_state
>>> print(h)
tensor([[[-1.1636, -0.5645,  0.4484,  ..., -0.2207, -0.1540,  0.1051],
         [-1.0394,  0.8815, -0.8070,  ...,  1.0930,  0.2069,  0.9613],
         [-0.2068, -0.1445, -0.6113,  ..., -1.2920,  0.0725, -0.2164],
         ...,
         [-1.2590,  0.0118,  0.4998,  ..., -0.5212, -0.8015, -0.1050],
         [ 0.7925, -0.7628,  0.1016,  ...,  0.2233,  0.0164,  0.0102],
         [-0.7847, -0.1375,  0.4475,  ..., -0.4014,  0.0346,  0.3157]]],
       grad_fn=<NativeLayerNormBackward0>)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nagisa_bert-0.0.2.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

nagisa_bert-0.0.2-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file nagisa_bert-0.0.2.tar.gz.

File metadata

  • Download URL: nagisa_bert-0.0.2.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.5

File hashes

Hashes for nagisa_bert-0.0.2.tar.gz
Algorithm Hash digest
SHA256 df46a6e84b360e4a966634e7033b79c3c5a5e04aa0b73cca37eff020cb788228
MD5 0fdabdfb3b6e18261a19ca4731b437cf
BLAKE2b-256 ea63926e2eaf4172a9573b8e3dcf57d7836cb7833c65b89cb18f8c9eea1bcfdc

See more details on using hashes here.

File details

Details for the file nagisa_bert-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: nagisa_bert-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.5

File hashes

Hashes for nagisa_bert-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 807046f2e65cbfdad5c8ce60950dce52c32a1f2ad856d3f644e76e7b24d1f801
MD5 6616f36cf0406d289b4b8c9fd6768b7f
BLAKE2b-256 e8e16c0413bafb79edff2a1f5937bc985de5a6e4eabdec1e0a876cd854573382

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page