Tokenizer for kodoc
Project description
kodoc-tokenizer
Tokenizer for kodoc
Requirements
transformers>=4.0
Installation
pip3 install kodoc-tokenizer
How to Use
Version
import kodoc_tokenizer
kodoc_tokenizer.__version__ # 0.1.0rc1
clean_text
from kodoc_tokenizer import clean_text
text = "Today a::: : \t\t \x00I \x00a 朝 三暮四 [MASK] m \na fool \n\nbecause I am a fool. \n [SEP][CLS] "
assert clean_text(text) == "Today a::: : I a 朝 三暮四 [MASK] m a fool because I am a fool. [SEP][CLS]"
Basic Function
from kodoc_tokenizer import KodocTokenizer
tokenizer = KodocTokenizer.from_pretrained("kodoc/kodoc-bert-base")
tokens = tokenizer.tokenize("다이어트마침표_1부 2013.7.25 02:24 PM 페이지1 제1부 다이어트 핵심 바이블 A`2`Z 다이어트에 실패하는 원인 중 하나는 잘못된 상식도 크게 한몫을 한다.")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kodoc-tokenizer-0.1.0rc1.tar.gz
.
File metadata
- Download URL: kodoc-tokenizer-0.1.0rc1.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.6.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4bd2f8dc8d904d96b477d07b9e33f867fe3c26a87ed59ee9b9bb4d954c50569 |
|
MD5 | 00fb14e4c2835086eb734980a64560ab |
|
BLAKE2b-256 | 4f3cfe8b7900d8d5efffbe9f770f37ca4b1734755035085ccd0c444faa541c97 |
File details
Details for the file kodoc_tokenizer-0.1.0rc1-py3-none-any.whl
.
File metadata
- Download URL: kodoc_tokenizer-0.1.0rc1-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.6.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d62b0ef08bfba522a7dd84254f90dc45d44ef63f60254e1cac3427faa6e3db5 |
|
MD5 | 24a427d590588a1f6cd0ec455084721f |
|
BLAKE2b-256 | 70b6a5fd086d5f7b0a6b0f8ce09d891b0b1a884b01601316cefec5bf7fbe60e9 |