Skip to main content

Vietnamese tokenization, preprocess and models NLP

Project description

Genz Tokenize

Installation:

pip install genz-tokenize

Using for tokenize basic

    >>> from genz_tokenize import Tokenize
    # using vocab from lib
    >>> tokenize = Tokenize()
    >>> print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
    # {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}

    >>> print(tokenize.decode([1, 770, 2]))
    # <s> sinh_viên </s>

    # from your vocab
    >>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

Using bert tokenize inheritance from PreTrainedTokenizer transformers

    >>> from genz_tokenize import TokenizeForBert
    # Using vocab from lib
    >>> tokenize = TokenizeForBert()
    >>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))
    # {'input_ids': [[1, 770, 1444, 2, 0], [1, 30469, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}

    # Using your vocab
    >>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')

Embedding matrix from fasttext

    >>> from genz_tokenize import get_embedding_matrix
    >>> embedding_matrix = get_embedding_matrix()

Model

1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer

Trainer

```python
>>> from genz_tokenize.models.utils import Config
>>> from genz_tokenize.models import Seq2Seq
>>> from genz_tokenize.models.training import TrainArgument, Trainer
# create config hyper parameter
>>> config = Config()
>>> config.vocab_size = 100
>>> config.target_vocab_size = 120
>>> config.units = 16
>>> config.maxlen = 20
# initial model
>>> model = Seq2Seq(config)
>>> x = tf.zeros(shape=(10, config.maxlen))
>>> y = tf.zeros(shape=(10, config.maxlen))
# create dataset
>>> BUFFER_SIZE = len(x)
>>> dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
>>> dataset_train = dataset_train.shuffle(BUFFER_SIZE)
>>> dataset_train = dataset_train.batch(2)
>>> dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)

>>> args = TrainArgument(batch_size=2, epochs=2)
>>> trainer = Trainer(model=model, args=args, data_train=dataset_train)
>>> trainer.train()
```

Create your vocab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genz-tokenize-1.1.4.tar.gz (62.4 MB view details)

Uploaded Source

Built Distribution

genz_tokenize-1.1.4-py3-none-any.whl (63.9 MB view details)

Uploaded Python 3

File details

Details for the file genz-tokenize-1.1.4.tar.gz.

File metadata

  • Download URL: genz-tokenize-1.1.4.tar.gz
  • Upload date:
  • Size: 62.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz-tokenize-1.1.4.tar.gz
Algorithm Hash digest
SHA256 824b6e9af4323b32b77ab03b26334c43bacfd12642cd2bd51dbb1693d04c7b2a
MD5 3c3db95886be197506f722bec34329e8
BLAKE2b-256 bd6b840551d05d7a8c72d2705b72c1d7b1137b8584748a1ecdd7515a635a83d8

See more details on using hashes here.

File details

Details for the file genz_tokenize-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: genz_tokenize-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 63.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz_tokenize-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 cc1c10af06f04f8b85a794bc1fc678274e94940589a62344a3c312aa3815b297
MD5 096eca8ba998acfe23bb617419c509f9
BLAKE2b-256 a2df071494d386edc5f7af93880638d88bb755e344c74a60d4e96a599de2c334

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page