Skip to main content

Japanese tokenizer with transformers library

Project description

jptranstokenizer: Japanese Tokenzier for transformers

Python pypi GitHub release License Test codecov

This is a repository for japanese tokenizer with HuggingFace library.
You can use JapaneseTransformerTokenizer like transformers.BertJapaneseTokenizer.
issue は日本語でも大丈夫です。

Documentations

Documentations are available on readthedoc.

Install

pip install jptranstokenizer

Quickstart

This is the example to use jptranstokenizer.JapaneseTransformerTokenizer with sentencepiece model of nlp-waseda/roberta-base-japanese and Juman++.
Before the following steps, you need to install pyknp and Juman++.

>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
>>> tokens = tokenizer.tokenize("外国人参政権")
# tokens: ['▁外国', '▁人', '▁参政', '▁権']

Note that different dependencies are required depending on the type of tokenizer you use.
See also Quickstart on Read the Docs

Citation

There will be another paper. Be sure to check here again when you cite.

This Implementation

@inproceedings{Suzuki-2023-nlp,
  jtitle = {{異なる単語分割システムによる日本語事前学習言語モデルの性能評価}},
  title = {{Performance Evaluation of Japanese Pre-trained Language Models with Different Word Segmentation Systems}},
  jauthor = {鈴木, 雅弘 and 坂地, 泰紀 and 和泉, 潔},
  author = {Suzuki, Masahiro and Sakaji, Hiroki and Izumi, Kiyoshi},
  jbooktitle = {言語処理学会 第29回年次大会 (NLP2023)},
  booktitle = {29th Annual Meeting of the Association for Natural Language Processing (NLP)},
  year = {2023},
  pages = {894-898}
}

Related Work

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jptranstokenizer-0.4.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

jptranstokenizer-0.4.0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file jptranstokenizer-0.4.0.tar.gz.

File metadata

  • Download URL: jptranstokenizer-0.4.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.18 Linux/6.2.0-1019-azure

File hashes

Hashes for jptranstokenizer-0.4.0.tar.gz
Algorithm Hash digest
SHA256 937e939a466abbfacc351c2e39b80d1b835d5d7bd503d060c7d5d611112493b8
MD5 90c3c302580b294c5ddf283971ec1c73
BLAKE2b-256 1467274690f833a8f044e1c24b3e32017bea623c51bc0f6ee1dde480ac9863e5

See more details on using hashes here.

File details

Details for the file jptranstokenizer-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: jptranstokenizer-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.18 Linux/6.2.0-1019-azure

File hashes

Hashes for jptranstokenizer-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce8db27202302431e6065b082231e76880198a2ee67fd211d1415815321ab4ca
MD5 9f4bade64d952165ceae0d45a1115624
BLAKE2b-256 915ee70a9ce54fbe5c8aa1b814b928e037a864b33b1552ee98daa4f34d8ae5b0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page