Skip to main content

Vietnamese tokenization, preprocess and models NLP

Project description

Genz Tokenize

PyPI

Using for tokenize

    from genz_tokenize import Tokenize
    # using vocab from lib
    tokenize = Tokenize()
    print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
    # {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}

    print(tokenize.decode([1, 770, 2]))
    # <s> sinh_viên </s>

    # from your vocab
    tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

Preprocessing data

    from genz_tokenize.preprocess import remove_punctuations,  convert_unicode, remove_emoji, vncore_tokenize

Model

1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer
4. BERT

Trainer

    from genz_tokenize.base_model.utils import Config
    from genz_tokenize.base_model.models import Seq2Seq, Transformer, TransformerClassification
    from genz_tokenize.base_model.training import TrainArgument, Trainer
    # create config hyper parameter
    config = Config()
    config.vocab_size = 100
    config.target_vocab_size = 120
    config.units = 16
    config.maxlen = 20
    # initial model
    model = Seq2Seq(config)
    x = tf.zeros(shape=(10, config.maxlen))
    y = tf.zeros(shape=(10, config.maxlen))
    # create dataset
    BUFFER_SIZE = len(x)
    dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
    dataset_train = dataset_train.shuffle(BUFFER_SIZE)
    dataset_train = dataset_train.batch(2)
    dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)

    args = TrainArgument(batch_size=2, epochs=2)
    trainer = Trainer(model=model, args=args, data_train=dataset_train)
    trainer.train()
    from genz_tokenize.models.bert import DataCollection
    from genz_tokenize.models.bert.training import TrainArg, Trainner
    from genz_tokenize.models.bert.roberta import RoBertaClassification, RobertaConfig
    import tensorflow as tf

    x = tf.zeros(shape=(10, 10), dtype=tf.int32)
    mask = tf.zeros(shape=(10, 10), dtype=tf.int32)
    y = tf.zeros(shape=(10, 2), dtype=tf.int32)

    dataset = DataCollection(
                    input_ids=x,
                    attention_mask=mask,
                    token_type_ids=None,
                    dec_input_ids=None,
                    dec_attention_mask=None,
                    dec_token_type_ids=None,
                    y=y
                )
    tf_dataset = dataset.to_tf_dataset(batch_size=2)

    config = RobertaConfig()
    config.num_class = 2
    model = RoBertaQAEncoderDecoder(config)
    arg = TrainArg(epochs=2, batch_size=2, learning_rate=1e-2)
    trainer = Trainner(model, arg, tf_dataset)
    trainer.train()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genz-tokenize-1.2.7.tar.gz (544.8 kB view details)

Uploaded Source

Built Distribution

genz_tokenize-1.2.7-py3-none-any.whl (552.5 kB view details)

Uploaded Python 3

File details

Details for the file genz-tokenize-1.2.7.tar.gz.

File metadata

  • Download URL: genz-tokenize-1.2.7.tar.gz
  • Upload date:
  • Size: 544.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for genz-tokenize-1.2.7.tar.gz
Algorithm Hash digest
SHA256 26c79eda6a71b382b7ceae23f226a0c5a7bdde4f1aed6e913e8da534a42dbdd0
MD5 293e211019372e2d0c1dc92bdd19ab74
BLAKE2b-256 9e761ee0f4c7eafcda8327c448c92f661de6131f97104c67dfe85c6616a6668b

See more details on using hashes here.

File details

Details for the file genz_tokenize-1.2.7-py3-none-any.whl.

File metadata

File hashes

Hashes for genz_tokenize-1.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 f1e97a1b87ae05e2957205f3cab45d1737cdcbc7bf9ad2f98228c06cab3c5135
MD5 6908bc8ddbd4c44f1378b06c3e177025
BLAKE2b-256 7eba1d4fb3044e49fb385eba5a1263e960d2231f67afd50133db4d8e7f0d78d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page