Skip to main content

Vietnamese tokenization, preprocess and models NLP

Project description

Genz Tokenize

PyPI

Using for tokenize

    from genz_tokenize import Tokenize
    # using vocab from lib
    tokenize = Tokenize()
    print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
    # {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}

    print(tokenize.decode([1, 770, 2]))
    # <s> sinh_viên </s>

    # from your vocab
    tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

Embedding matrix from fasttext

    from genz_tokenize import get_embedding_matrix
    embedding_matrix = get_embedding_matrix()

Preprocessing data

    from genz_tokenize.preprocess import remove_punctuations,  convert_unicode, remove_emoji, vncore_tokenize

Model

1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer
4. BERT

Trainer

    from genz_tokenize.base_model.utils import Config
    from genz_tokenize.base_model.models import Seq2Seq, Transformer, TransformerClassification
    from genz_tokenize.base_model.training import TrainArgument, Trainer
    # create config hyper parameter
    config = Config()
    config.vocab_size = 100
    config.target_vocab_size = 120
    config.units = 16
    config.maxlen = 20
    # initial model
    model = Seq2Seq(config)
    x = tf.zeros(shape=(10, config.maxlen))
    y = tf.zeros(shape=(10, config.maxlen))
    # create dataset
    BUFFER_SIZE = len(x)
    dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
    dataset_train = dataset_train.shuffle(BUFFER_SIZE)
    dataset_train = dataset_train.batch(2)
    dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)

    args = TrainArgument(batch_size=2, epochs=2)
    trainer = Trainer(model=model, args=args, data_train=dataset_train)
    trainer.train()
    from genz_tokenize.models.bert import DataCollection
    from genz_tokenize.models.bert.training import TrainArg, Trainner
    from genz_tokenize.models.bert.roberta import RoBertaClassification, RobertaConfig
    import tensorflow as tf

    x = tf.zeros(shape=(10, 10), dtype=tf.int32)
    mask = tf.zeros(shape=(10, 10), dtype=tf.int32)
    y = tf.zeros(shape=(10, 2), dtype=tf.int32)

    dataset = DataCollection(
                    input_ids=x,
                    attention_mask=mask,
                    token_type_ids=None,
                    dec_input_ids=None,
                    dec_attention_mask=None,
                    dec_token_type_ids=None,
                    y=y
                )
    tf_dataset = dataset.to_tf_dataset(batch_size=2)

    config = RobertaConfig()
    config.num_class = 2
    model = RoBertaQAEncoderDecoder(config)
    arg = TrainArg(epochs=2, batch_size=2, learning_rate=1e-2)
    trainer = Trainner(model, arg, tf_dataset)
    trainer.train()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genz-tokenize-1.2.7a2.tar.gz (62.5 MB view details)

Uploaded Source

Built Distribution

genz_tokenize-1.2.7a2-py3-none-any.whl (63.9 MB view details)

Uploaded Python 3

File details

Details for the file genz-tokenize-1.2.7a2.tar.gz.

File metadata

  • Download URL: genz-tokenize-1.2.7a2.tar.gz
  • Upload date:
  • Size: 62.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz-tokenize-1.2.7a2.tar.gz
Algorithm Hash digest
SHA256 fc0bb0baf6a2fd2d50c9b66978bdbac0fb47215a3682d06068d960b480fa4690
MD5 1750588c59884944c8ee95720c56b3b8
BLAKE2b-256 5d10032edb3ce6ebdad356266c6034523f98d518561822d006608602dc0a8592

See more details on using hashes here.

File details

Details for the file genz_tokenize-1.2.7a2-py3-none-any.whl.

File metadata

  • Download URL: genz_tokenize-1.2.7a2-py3-none-any.whl
  • Upload date:
  • Size: 63.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz_tokenize-1.2.7a2-py3-none-any.whl
Algorithm Hash digest
SHA256 a43271377ce7ff43733f031e4f86728ac656088874837b45c18104ecf4c2a967
MD5 fdb0bf2e59b88c2e5db266124ee2f30f
BLAKE2b-256 3cc3942d0155b4093463d1d6bcb0bdb5d5c350d86ceb4c35626c9bb56a608de7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page