Skip to main content

Japanese tokenizer with transformers library

Project description

jptranstokenizer: Japanese Tokenzier for transformers

Python pypi GitHub release License Test codecov

This is a repository for japanese tokenizer with HuggingFace library.
You can use JapaneseTransformerTokenizer like transformers.BertJapaneseTokenizer.
issue は日本語でも大丈夫です。

Documentations

Documentations are available on readthedoc.

Install

pip install jptranstokenizer

Quickstart

This is the example to use jptranstokenizer.JapaneseTransformerTokenizer with sentencepiece model of nlp-waseda/roberta-base-japanese and Juman++.
Before the following steps, you need to install pyknp and Juman++.

>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
>>> tokens = tokenizer.tokenize("外国人参政権")
# tokens: ['▁外国', '▁人', '▁参政', '▁権']

Note that different dependencies are required depending on the type of tokenizer you use.
See also Quickstart on Read the Docs

Citation

There will be another paper. Be sure to check here again when you cite.

This Implementation

@inproceedings{Suzuki-2023-nlp,
  jtitle = {{異なる単語分割システムによる日本語事前学習言語モデルの性能評価}},
  title = {{Performance Evaluation of Japanese Pre-trained Language Models with Different Word Segmentation Systems}},
  jauthor = {鈴木, 雅弘 and 坂地, 泰紀 and 和泉, 潔},
  author = {Suzuki, Masahiro and Sakaji, Hiroki and Izumi, Kiyoshi},
  jbooktitle = {言語処理学会 第29回年次大会 (NLP2023)},
  booktitle = {29th Annual Meeting of the Association for Natural Language Processing (NLP)},
  year = {2023},
  pages = {894-898}
}

Related Work

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jptranstokenizer-0.3.2.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jptranstokenizer-0.3.2-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file jptranstokenizer-0.3.2.tar.gz.

File metadata

  • Download URL: jptranstokenizer-0.3.2.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.9.16 Linux/5.15.0-1036-azure

File hashes

Hashes for jptranstokenizer-0.3.2.tar.gz
Algorithm Hash digest
SHA256 459d669ca23638cb4b23672de5f64466dd279874c4fa267d40c3f4933a081359
MD5 d4229739a4e73973e953a617b6eb5532
BLAKE2b-256 1b87102949c0ff3e3a597070140706ff2ba7fc1acd1f1ffd8081e346e41b39b0

See more details on using hashes here.

File details

Details for the file jptranstokenizer-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: jptranstokenizer-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.9.16 Linux/5.15.0-1036-azure

File hashes

Hashes for jptranstokenizer-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cd7502ebf77f69ae92cbf0596a7f6d7906c7040d8a91aaecde45aa457f335833
MD5 d78270eacad99f3cfe2bf716e8a168f8
BLAKE2b-256 234be2633ba7ddca419501a9decb44b87dc31275ac50ebd1cca45783d9037e48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page