Skip to main content

Tokenizer for kodoc

Project description

kodoc-tokenizer

  • Tokenizer for kodoc
  • Based on transformers==4.7.0

Installation

pip3 install kodoc-tokenizer

How to Use

Version

import kodoc_tokenizer

kodoc_tokenizer.__version__  # 0.2.0rc1

clean_text

from kodoc_tokenizer import clean_text

text = "Today a::: : \t\t \x00I \x00a  朝 三暮四 [MASK] m \na fool \n\nbecause I am a fool. \n [SEP][CLS]  "
assert clean_text(text) == "Today a::: : I a 朝 三暮四 [MASK] m a fool because I am a fool. [SEP][CLS]"

Basic Function

from kodoc_tokenizer import get_kodoc_tokenizer

tokenizer = get_kodoc_tokenizer()
tokens = tokenizer.tokenize("다이어트마침표_1부 2013.7.25 02:24 PM 페이지1 제1부 다이어트 핵심 바이블 A`2`Z 다이어트에 실패하는 원인 중 하나는 잘못된 상식도 크게 한몫을 한다.")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kodoc-tokenizer-0.2.0rc1.tar.gz (90.4 kB view details)

Uploaded Source

Built Distribution

kodoc_tokenizer-0.2.0rc1-py3-none-any.whl (92.3 kB view details)

Uploaded Python 3

File details

Details for the file kodoc-tokenizer-0.2.0rc1.tar.gz.

File metadata

  • Download URL: kodoc-tokenizer-0.2.0rc1.tar.gz
  • Upload date:
  • Size: 90.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for kodoc-tokenizer-0.2.0rc1.tar.gz
Algorithm Hash digest
SHA256 a8eb8bf2ff82e40618c72644d03260eb25b1734cce947b020bad47c16a1f55c2
MD5 9a250358783c2976dddc5e10be024dee
BLAKE2b-256 27991fff4397ea3249b1511d058b6a9f85999b08501ade46e8755ebe13a40080

See more details on using hashes here.

File details

Details for the file kodoc_tokenizer-0.2.0rc1-py3-none-any.whl.

File metadata

  • Download URL: kodoc_tokenizer-0.2.0rc1-py3-none-any.whl
  • Upload date:
  • Size: 92.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for kodoc_tokenizer-0.2.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 db32603bc9be2632979c88489c1caeda5496800a423c60619532388fc5a53377
MD5 7de6fd4f4426b367ca35b147ceea4fd7
BLAKE2b-256 908358bfc873767f665b481069c775f2785e79c435805c6edfcada8e3f48f780

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page