Tokenizer for kodoc
Project description
kodoc-tokenizer
- Tokenizer for kodoc
- Based on
transformers==4.7.0
Installation
pip3 install kodoc-tokenizer
How to Use
Version
import kodoc_tokenizer
kodoc_tokenizer.__version__ # 0.2.0rc1
clean_text
from kodoc_tokenizer import clean_text
text = "Today a::: : \t\t \x00I \x00a 朝 三暮四 [MASK] m \na fool \n\nbecause I am a fool. \n [SEP][CLS] "
assert clean_text(text) == "Today a::: : I a 朝 三暮四 [MASK] m a fool because I am a fool. [SEP][CLS]"
Basic Function
from kodoc_tokenizer import get_kodoc_tokenizer
tokenizer = get_kodoc_tokenizer()
tokens = tokenizer.tokenize("다이어트마침표_1부 2013.7.25 02:24 PM 페이지1 제1부 다이어트 핵심 바이블 A`2`Z 다이어트에 실패하는 원인 중 하나는 잘못된 상식도 크게 한몫을 한다.")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kodoc-tokenizer-0.2.0rc1.tar.gz
(90.4 kB
view details)
Built Distribution
File details
Details for the file kodoc-tokenizer-0.2.0rc1.tar.gz
.
File metadata
- Download URL: kodoc-tokenizer-0.2.0rc1.tar.gz
- Upload date:
- Size: 90.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8eb8bf2ff82e40618c72644d03260eb25b1734cce947b020bad47c16a1f55c2 |
|
MD5 | 9a250358783c2976dddc5e10be024dee |
|
BLAKE2b-256 | 27991fff4397ea3249b1511d058b6a9f85999b08501ade46e8755ebe13a40080 |
File details
Details for the file kodoc_tokenizer-0.2.0rc1-py3-none-any.whl
.
File metadata
- Download URL: kodoc_tokenizer-0.2.0rc1-py3-none-any.whl
- Upload date:
- Size: 92.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db32603bc9be2632979c88489c1caeda5496800a423c60619532388fc5a53377 |
|
MD5 | 7de6fd4f4426b367ca35b147ceea4fd7 |
|
BLAKE2b-256 | 908358bfc873767f665b481069c775f2785e79c435805c6edfcada8e3f48f780 |