Skip to main content

Vietnamese tokenization, preprocess and models NLP

Project description

Genz Tokenize

Installation:

pip install genz-tokenize

Using for tokenize basic

    >>> from genz_tokenize import Tokenize
    # using vocab from lib
    >>> tokenize = Tokenize()
    >>> print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
    # {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}

    >>> print(tokenize.decode([1, 770, 2]))
    # <s> sinh_viên </s>

    # from your vocab
    >>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

Using bert tokenize inheritance from PreTrainedTokenizer transformers

    >>> from genz_tokenize import TokenizeForBert
    # Using vocab from lib
    >>> tokenize = TokenizeForBert()
    >>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))
    # {'input_ids': [[1, 770, 1444, 2, 0], [1, 30469, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}

    # Using your vocab
    >>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')

Embedding matrix from fasttext

    >>> from genz_tokenize import get_embedding_matrix
    >>> embedding_matrix = get_embedding_matrix()

Model

1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer

Trainer

    >>> from genz_tokenize.utils import Config
    >>> from genz_tokenize.models import Seq2Seq
    >>> from genz_tokenize.training import TrainArgument, Trainer
    # create config hyper parameter
    >>> config = Config()
    >>> config.vocab_size = 100
    >>> config.target_vocab_size = 120
    >>> config.units = 16
    >>> config.maxlen = 20
    # initial model
    >>> model = Seq2Seq(config)
    >>> x = tf.zeros(shape=(10, config.maxlen))
    >>> y = tf.zeros(shape=(10, config.maxlen))
    # create dataset
    >>> BUFFER_SIZE = len(x)
    >>> dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
    >>> dataset_train = dataset_train.shuffle(BUFFER_SIZE)
    >>> dataset_train = dataset_train.batch(2)
    >>> dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)

    >>> args = TrainArgument(batch_size=2, epochs=2)
    >>> trainer = Trainer(model=model, args=args, data_train=dataset_train)
    >>> trainer.train()

Create your vocab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genz-tokenize-1.1.7.tar.gz (62.5 MB view details)

Uploaded Source

Built Distribution

genz_tokenize-1.1.7-py3-none-any.whl (63.9 MB view details)

Uploaded Python 3

File details

Details for the file genz-tokenize-1.1.7.tar.gz.

File metadata

  • Download URL: genz-tokenize-1.1.7.tar.gz
  • Upload date:
  • Size: 62.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for genz-tokenize-1.1.7.tar.gz
Algorithm Hash digest
SHA256 c782646b37101a0285faedbcdbb8b0d9bbc6703f66bfea4be38e5dab44126501
MD5 61706bd4ad957b55466d4e572cb56bf9
BLAKE2b-256 b8f5443c1dec4168a1f1b52182a2b7be7c67719c20eab82c1a163eb27738f851

See more details on using hashes here.

File details

Details for the file genz_tokenize-1.1.7-py3-none-any.whl.

File metadata

  • Download URL: genz_tokenize-1.1.7-py3-none-any.whl
  • Upload date:
  • Size: 63.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for genz_tokenize-1.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e7548b9f155c79b8a01c6689a6a8dd8d5bbc0c40ce56cae62ac27b9fd0571aa8
MD5 e849542b4d09075d13a72df405be40f0
BLAKE2b-256 9da8a87238a84b8d6d7e7cece0f61e75af3db650c8bdd0b1f7bd0d2ffe5e4172

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page