Skip to main content

Some utils for tokenization in pytorch

Project description

tkzs

Installation

pip install tkzs

Use a simple tokenizer

from tkzs.tokenizers import re_tokenizer

txt = "Contrastive Fine-tuning Improves Robustness for Neural Rankers"

re_tokenizer(txt)

Use a spacy word tokenizer

from tkzs.tokenizers import SpacyTokenizer

txt = "Contrastive Fine-tuning Improves Robustness for Neural Rankers"

tokenizer = SpacyTokenizer(name='en_core_web_sm')

tokenizer.tokenize(txt)

Use a word encoder

from tkzs.encoders import WordEncoder
from tkzs.tokenizers import re_tokenizer

docs = [
    "Contrastive Fine-tuning Improves Robustness for Neural Rankers",
    "Unsupervised Neural Machine Translation for Low-Resource Domains via Meta-Learning",
    "Spatial Dependency Parsing for Semi-Structured Document Information Extraction"
    ]

encoder = WordEncoder(tokenizer=re_tokenizer)

encoder.fit(docs)

encoder.batch_tokenize(docs) # return a list of tokenized sequence

encoder.encode_batch(docs) # return a tensor of size [batch_size, max_length]

Use a byte encoder

from tkzs.encoders import ByteEncoder
from tkzs.tokenizers import re_tokenizer

docs = [
    "Contrastive Fine-tuning Improves Robustness for Neural Rankers",
    "Unsupervised Neural Machine Translation for Low-Resource Domains via Meta-Learning",
    "Spatial Dependency Parsing for Semi-Structured Document Information Extraction"
    ]

encoder = ByteEncoder()

# return a tensor of shape [Batch, Word, Char]
encoder.encode_batch(docs, char_padding='center', word_length=None, tokenizer=re_tokenizer)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tkzs-0.0.7.tar.gz (3.8 kB view details)

Uploaded Source

Built Distribution

tkzs-0.0.7-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file tkzs-0.0.7.tar.gz.

File metadata

  • Download URL: tkzs-0.0.7.tar.gz
  • Upload date:
  • Size: 3.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for tkzs-0.0.7.tar.gz
Algorithm Hash digest
SHA256 adc675ce823291dd51798b614a635c3abb74b5d1874d1b3762a4ef4066a9351a
MD5 cf37deeab66530904525b48ea01d53a7
BLAKE2b-256 82095b21c022753826f3ad606879957c828487a5bc7ba43824396144614d6979

See more details on using hashes here.

File details

Details for the file tkzs-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: tkzs-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for tkzs-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 52c9ec2c35dc4102a08789d4682499e9deddf6180707454bcec37523955990b3
MD5 e7c87253060a9f567adacd25acdb8fd7
BLAKE2b-256 02dd48fb191e9ec83d1f1a8f3e758b6d7c2e385f8f47b5a8d00442bb14b9eba1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page