Skip to main content

Vietnamese tokenization, preprocess and models NLP

Project description

Genz Tokenize

Installation:

pip install genz-tokenize

Using for tokenize

    >>> from genz_tokenize import Tokenize
    # using vocab from lib
    >>> tokenize = Tokenize()
    >>> print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
    # {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}

    >>> print(tokenize.decode([1, 770, 2]))
    # <s> sinh_viên </s>

    # from your vocab
    >>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

Using bert tokenize inheritance from PreTrainedTokenizer Transformers

    >>> from genz_tokenize import TokenizeForBert
    # Using vocab from lib
    >>> tokenize = TokenizeForBert()
    >>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))
    # {'input_ids': [[1, 770, 1444, 2, 0], [1, 30469, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}

    # Using your vocab
    >>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')

Embedding matrix from fasttext

    >>> from genz_tokenize import get_embedding_matrix
    >>> embedding_matrix = get_embedding_matrix()

Preprocessing data

    >>> from genz_tokenize.preprocess import remove_punctuations,  convert_unicode, remove_emoji, vncore_tokenize

Model

1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer
4. BERT

Trainer

    >>> from genz_tokenize.base_model.utils import Config
    >>> from genz_tokenize.base_model.models import Seq2Seq, Transformer, TransformerClassification
    >>> from genz_tokenize.base_model.training import TrainArgument, Trainer
    # create config hyper parameter
    >>> config = Config()
    >>> config.vocab_size = 100
    >>> config.target_vocab_size = 120
    >>> config.units = 16
    >>> config.maxlen = 20
    # initial model
    >>> model = Seq2Seq(config)
    >>> x = tf.zeros(shape=(10, config.maxlen))
    >>> y = tf.zeros(shape=(10, config.maxlen))
    # create dataset
    >>> BUFFER_SIZE = len(x)
    >>> dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
    >>> dataset_train = dataset_train.shuffle(BUFFER_SIZE)
    >>> dataset_train = dataset_train.batch(2)
    >>> dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)

    >>> args = TrainArgument(batch_size=2, epochs=2)
    >>> trainer = Trainer(model=model, args=args, data_train=dataset_train)
    >>> trainer.train()
    >>> from genz_tokenize.models.bert import DataCollection
    >>> from genz_tokenize.models.bert.training import TrainArg, Trainner
    >>> from genz_tokenize.models.bert.roberta import RoBertaClassification, RobertaConfig
    >>> import tensorflow as tf

    >>> x = tf.zeros(shape=(10, 10), dtype=tf.int32)
    >>> mask = tf.zeros(shape=(10, 10), dtype=tf.int32)
    >>> y = tf.zeros(shape=(10, 2), dtype=tf.int32)

    >>> dataset = DataCollection(
                    input_ids=x,
                    attention_mask=mask,
                    token_type_ids=None,
                    dec_input_ids=None,
                    dec_attention_mask=None,
                    dec_token_type_ids=None,
                    y=y
                )
    >>> tf_dataset = dataset.to_tf_dataset(batch_size=2)

    >>> config = RobertaConfig()
    >>> config.num_class = 2
    >>> model = RoBertaQAEncoderDecoder(config)
    >>> arg = TrainArg(epochs=2, batch_size=2, learning_rate=1e-2)
    >>> trainer = Trainner(model, arg, tf_dataset)
    >>> trainer.train()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genz-tokenize-1.2.5.tar.gz (62.5 MB view details)

Uploaded Source

Built Distribution

genz_tokenize-1.2.5-py3-none-any.whl (63.9 MB view details)

Uploaded Python 3

File details

Details for the file genz-tokenize-1.2.5.tar.gz.

File metadata

  • Download URL: genz-tokenize-1.2.5.tar.gz
  • Upload date:
  • Size: 62.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz-tokenize-1.2.5.tar.gz
Algorithm Hash digest
SHA256 30f346cb40ba1e68131c3ba0fbf7decdd27f240945eb0ae3b0c8597c6b983be9
MD5 d0a5745df293bacf0fbdaba4ae7eb5d7
BLAKE2b-256 709aff5d44632e29c2a26bc6b211f0181a293971c75970683db135d3a462395e

See more details on using hashes here.

File details

Details for the file genz_tokenize-1.2.5-py3-none-any.whl.

File metadata

  • Download URL: genz_tokenize-1.2.5-py3-none-any.whl
  • Upload date:
  • Size: 63.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz_tokenize-1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 36c279b9a8048b97164a0ff990f4f92e4f0da7fc6f49a6610cc4c42c6fec0a8b
MD5 410c1ad083e78304609137fedc9f12b3
BLAKE2b-256 4c2024b8fd863c184d4856b2b4ec9a6a02d672a4e6a90a99ea2cafa633120510

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page