Some utils for tokenization in pytorch
Project description
tkzs
Installation
pip install tkzs
Use a simple tokenizer
from tkzs.tokenizers import re_tokenizer
txt = "Contrastive Fine-tuning Improves Robustness for Neural Rankers"
re_tokenizer(txt)
Use a spacy word tokenizer
from tkzs.tokenizers import SpacyTokenizer
txt = "Contrastive Fine-tuning Improves Robustness for Neural Rankers"
tokenizer = SpacyTokenizer(name='en_core_web_sm')
tokenizer.tokenize(txt)
Use a word encoder
from tkzs.encoders import WordEncoder
from tkzs.tokenizers import re_tokenizer
docs = [
"Contrastive Fine-tuning Improves Robustness for Neural Rankers",
"Unsupervised Neural Machine Translation for Low-Resource Domains via Meta-Learning",
"Spatial Dependency Parsing for Semi-Structured Document Information Extraction"
]
encoder = WordEncoder(tokenizer=re_tokenizer)
encoder.fit(docs)
encoder.batch_tokenize(docs) # return a list of tokenized sequence
encoder.encode_batch(docs) # return a tensor of size [batch_size, max_length]
Use a byte encoder
from tkzs.encoders import ByteEncoder
from tkzs.tokenizers import re_tokenizer
docs = [
"Contrastive Fine-tuning Improves Robustness for Neural Rankers",
"Unsupervised Neural Machine Translation for Low-Resource Domains via Meta-Learning",
"Spatial Dependency Parsing for Semi-Structured Document Information Extraction"
]
encoder = ByteEncoder()
# return a tensor of shape [Batch, Word, Char]
encoder.encode_batch(docs, char_padding='center', word_length=None, tokenizer=re_tokenizer)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tkzs-0.0.7.tar.gz
(3.8 kB
view details)
Built Distribution
tkzs-0.0.7-py3-none-any.whl
(4.5 kB
view details)
File details
Details for the file tkzs-0.0.7.tar.gz
.
File metadata
- Download URL: tkzs-0.0.7.tar.gz
- Upload date:
- Size: 3.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/3.10.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | adc675ce823291dd51798b614a635c3abb74b5d1874d1b3762a4ef4066a9351a |
|
MD5 | cf37deeab66530904525b48ea01d53a7 |
|
BLAKE2b-256 | 82095b21c022753826f3ad606879957c828487a5bc7ba43824396144614d6979 |
File details
Details for the file tkzs-0.0.7-py3-none-any.whl
.
File metadata
- Download URL: tkzs-0.0.7-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/3.10.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52c9ec2c35dc4102a08789d4682499e9deddf6180707454bcec37523955990b3 |
|
MD5 | e7c87253060a9f567adacd25acdb8fd7 |
|
BLAKE2b-256 | 02dd48fb191e9ec83d1f1a8f3e758b6d7c2e385f8f47b5a8d00442bb14b9eba1 |