Skip to main content

Vietnamese tokenization, preprocess and models NLP

Project description

Genz Tokenize

Installation:

pip install genz-tokenize

Using for tokenize basic

    >>> from genz_tokenize import Tokenize
    # using vocab from lib
    >>> tokenize = Tokenize()
    >>> print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
    # {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}

    >>> print(tokenize.decode([1, 770, 2]))
    # <s> sinh_viên </s>

    # from your vocab
    >>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

Using bert tokenize inheritance from PreTrainedTokenizer transformers

    >>> from genz_tokenize import TokenizeForBert
    # Using vocab from lib
    >>> tokenize = TokenizeForBert()
    >>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))
    # {'input_ids': [[1, 770, 1444, 2, 0], [1, 30469, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}

    # Using your vocab
    >>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')

Embedding matrix from fasttext

    >>> from genz_tokenize import get_embedding_matrix
    >>> embedding_matrix = get_embedding_matrix()

Model

1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer

Trainer

    >>> from genz_tokenize.models.utils import Config
    >>> from genz_tokenize.models import Seq2Seq
    >>> from genz_tokenize.models.training import TrainArgument, Trainer
    # create config hyper parameter
    >>> config = Config()
    >>> config.vocab_size = 100
    >>> config.target_vocab_size = 120
    >>> config.units = 16
    >>> config.maxlen = 20
    # initial model
    >>> model = Seq2Seq(config)
    >>> x = tf.zeros(shape=(10, config.maxlen))
    >>> y = tf.zeros(shape=(10, config.maxlen))
    # create dataset
    >>> BUFFER_SIZE = len(x)
    >>> dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
    >>> dataset_train = dataset_train.shuffle(BUFFER_SIZE)
    >>> dataset_train = dataset_train.batch(2)
    >>> dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)

    >>> args = TrainArgument(batch_size=2, epochs=2)
    >>> trainer = Trainer(model=model, args=args, data_train=dataset_train)
    >>> trainer.train()

Create your vocab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genz-tokenize-1.1.5.tar.gz (62.4 MB view details)

Uploaded Source

Built Distribution

genz_tokenize-1.1.5-py3-none-any.whl (63.9 MB view details)

Uploaded Python 3

File details

Details for the file genz-tokenize-1.1.5.tar.gz.

File metadata

  • Download URL: genz-tokenize-1.1.5.tar.gz
  • Upload date:
  • Size: 62.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz-tokenize-1.1.5.tar.gz
Algorithm Hash digest
SHA256 57298087ba445e945b401c89aa1150ed8d30b32e836fc94c2e89e0b056daffe8
MD5 2a678a117f0110a32295213279e078ac
BLAKE2b-256 97e85b84c4c6d6ee7e7777804270dca1c3bb0629a89f74e2e8e328182db814d2

See more details on using hashes here.

File details

Details for the file genz_tokenize-1.1.5-py3-none-any.whl.

File metadata

  • Download URL: genz_tokenize-1.1.5-py3-none-any.whl
  • Upload date:
  • Size: 63.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz_tokenize-1.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9e02451f1bf5f4a51234d5b501be6a7aa9a287f44523e07234af444f2c43e2e2
MD5 bb6c84144b5166e6ea7ac7355f54f0c5
BLAKE2b-256 40bdc07f569170da695afae916e52b629eac89451f37298e0f9b101f0b6b5a06

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page