Skip to main content

Japanese tokenizer with transformers library

Project description

jptranstokenizer: Japanese Tokenzier for transformers

Python pypi License Test GitHub release

This is a repository for japanese tokenizer with HuggingFace library.
You can use JapaneseTransformerTokenizer like transformers.BertJapaneseTokenizer.
issue は日本語でも大丈夫です。

Documentations

Documentations are available on readthedoc.

Install

pip install jptranstokenizer

Quickstart

This is the example to use jptranstokenizer.JapaneseTransformerTokenizer with sentencepiece model of nlp-waseda/roberta-base-japanese and Juman++.
Before the following steps, you need to install pyknp and Juman++.

>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
>>> tokens = tokenizer.tokenize("外国人参政権")
# tokens: ['▁外国', '▁人', '▁参政', '▁権']

Note that different dependencies are required depending on the type of tokenizer you use.
See also Quickstart on Read the Docs

Citation

There will be another paper. Be sure to check here again when you cite.

This Implementation

@misc{suzuki-2022-github,
  author = {Masahiro Suzuki},
  title = {jptranstokenizer: Japanese Tokenzier for transformers},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/retarfi/jptranstokenizer}}}

Related Work

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jptranstokenizer-0.2.0.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jptranstokenizer-0.2.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file jptranstokenizer-0.2.0.tar.gz.

File metadata

  • Download URL: jptranstokenizer-0.2.0.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.16 Linux/5.15.0-1030-azure

File hashes

Hashes for jptranstokenizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a62550f076db3734f77cbffe0883808f2668077a7dba867434ed7f128087c904
MD5 6e349aaf23b242cb3fb1218cd2151e01
BLAKE2b-256 9d1c7927d364f9d9d4c3bc26b10ad9061a473e68057ce4a81d457f3d703affb2

See more details on using hashes here.

File details

Details for the file jptranstokenizer-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: jptranstokenizer-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.9.16 Linux/5.15.0-1030-azure

File hashes

Hashes for jptranstokenizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de0979583373f763387af86066b32e752fff7520b6b203e5c9d7c3c7f5e05a57
MD5 bd7f5c9272fc920b90e9b7ae637e64db
BLAKE2b-256 5457c0fd0c0e482ee6274178b9429ba7d7b925d47973238dda938b541616ff73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page