Skip to main content

Japanese tokenizer with transformers library

Project description

jptranstokenizer: Japanese Tokenzier for transformers

Python pypi GitHub release License Test codecov

This is a repository for japanese tokenizer with HuggingFace library.
You can use JapaneseTransformerTokenizer like transformers.BertJapaneseTokenizer.
issue は日本語でも大丈夫です。

Documentations

Documentations are available on readthedoc.

Install

pip install jptranstokenizer

Quickstart

This is the example to use jptranstokenizer.JapaneseTransformerTokenizer with sentencepiece model of nlp-waseda/roberta-base-japanese and Juman++.
Before the following steps, you need to install pyknp and Juman++.

>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
>>> tokens = tokenizer.tokenize("外国人参政権")
# tokens: ['▁外国', '▁人', '▁参政', '▁権']

Note that different dependencies are required depending on the type of tokenizer you use.
See also Quickstart on Read the Docs

Citation

There will be another paper. Be sure to check here again when you cite.

This Implementation

@misc{suzuki-2022-github,
  author = {Masahiro Suzuki},
  title = {jptranstokenizer: Japanese Tokenzier for transformers},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/retarfi/jptranstokenizer}}
}

Related Work

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jptranstokenizer-0.3.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jptranstokenizer-0.3.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file jptranstokenizer-0.3.0.tar.gz.

File metadata

  • Download URL: jptranstokenizer-0.3.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.16 Linux/5.15.0-1031-azure

File hashes

Hashes for jptranstokenizer-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ca27dabfa39089f0528c8be8e2fd7005902668beea9d7066fc6f28ffa0b8972a
MD5 a2e27d71fba599a49ff74805a3d05f4b
BLAKE2b-256 fce1ba29e0fcdd6c9f08d19c3b1adc42532072f3fe348d1e59fbf76ad8560855

See more details on using hashes here.

File details

Details for the file jptranstokenizer-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: jptranstokenizer-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.16 Linux/5.15.0-1031-azure

File hashes

Hashes for jptranstokenizer-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af448b8660bc2be8471dfde1a3f3912186dc94e90c0a36111c9048ef481b4ff2
MD5 a86a7d8e28faf18f4547b20b22de14bd
BLAKE2b-256 bbc54f4e8836b03e2307787855d678e7f60927aaa8fc00895b85d521d6ea9385

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page