Skip to main content

Japanese tokenizer with transformers library

Project description

jptranstokenizer: Japanese Tokenzier for transformers

Python pypi GitHub release License Test codecov

This is a repository for japanese tokenizer with HuggingFace library.
You can use JapaneseTransformerTokenizer like transformers.BertJapaneseTokenizer.
issue は日本語でも大丈夫です。

Documentations

Documentations are available on readthedoc.

Install

pip install jptranstokenizer

Quickstart

This is the example to use jptranstokenizer.JapaneseTransformerTokenizer with sentencepiece model of nlp-waseda/roberta-base-japanese and Juman++.
Before the following steps, you need to install pyknp and Juman++.

>>> from jptranstokenizer import JapaneseTransformerTokenizer
>>> tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese")
>>> tokens = tokenizer.tokenize("外国人参政権")
# tokens: ['▁外国', '▁人', '▁参政', '▁権']

Note that different dependencies are required depending on the type of tokenizer you use.
See also Quickstart on Read the Docs

Citation

There will be another paper. Be sure to check here again when you cite.

This Implementation

@misc{suzuki-2022-github,
  author = {Masahiro Suzuki},
  title = {jptranstokenizer: Japanese Tokenzier for transformers},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/retarfi/jptranstokenizer}}
}

Related Work

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jptranstokenizer-0.3.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jptranstokenizer-0.3.1-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file jptranstokenizer-0.3.1.tar.gz.

File metadata

  • Download URL: jptranstokenizer-0.3.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.9.16 Linux/5.15.0-1033-azure

File hashes

Hashes for jptranstokenizer-0.3.1.tar.gz
Algorithm Hash digest
SHA256 47b8b07ddcfe332e7ac5918d0442bc2174d1e6e248eac64e6b661c83b360a483
MD5 bf0e757dc915daa029cfd2de35e38e70
BLAKE2b-256 8b0a57d34d4506051279b58328f5f560938b1b43ab013f539d2e2cbc1b4f1b08

See more details on using hashes here.

File details

Details for the file jptranstokenizer-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: jptranstokenizer-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.9.16 Linux/5.15.0-1033-azure

File hashes

Hashes for jptranstokenizer-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8815fc969ca0b7d6e0e74803f73525f6f93fd9fde730660dfa2af88127d4ee29
MD5 56d4b611ab9f99b0654d1409b1bbcf46
BLAKE2b-256 76efc803c17eb5ab82a735240c7cb9455c07b02588fb69e85724e34be15f78d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page