OpenNMT Tokenizer as TensorFlow Operations
Project description
OpenNMT Tokenizer TensorFlow Ops
DISCLAIMER: This package is not published by the OpenNMT authors.
Full credits for OpenNMT Tokenizer
and OpenNMT-tf goes to their respectively
authors.
This project aims to wrap OpenNMT Tokenizer into TensorFlow Ops.
It's primarily intended to be used as an addition to the OpenNMT-tf framework, in order to remove the need of applying tokenization and/or detokenization outside of a serving environment (e.g. TensorFlow Serving).
Compatibility
- TensorFlow
2.1
,2.2
- OpenNMT-tf >=
2.6.0
for usage in conjunction with OpenNMT-tf
Installation
Prerequisites :
- A Linux environment (
manylinux2014
eligible) - Python
3.5
,3.6
,3.7
or3.8
Install the package with pip :
pip install tensorflow-onmttok-ops
Usage
Available Tokenizer options
The majority of the OpenNMT Tokenizer
options
are available.
However, providing BPE
or SentencePiece
models is not supported,
and by extension, setting the tokenizer mode
to none
is not supported.
You therefore cannot use the following options :
bpe_model_path
sp_model_path
sp_nbest_size
sp_alpha
vocabulary_path
vocabulary_threshold
Note: Tokenizer options are defined at graph construction time and are constants.
Tokenization
import tensorflow_onmttok as tf_onmttok
tokens = tf_onmttok.tokenize(["Hello, how are you?"], mode="conservative")
Detokenization
import tensorflow_onmttok as tf_onmttok
text = tf_onmttok.detokenize(["How", "are", "you", "?"], mode="space")
With OpenNMT-tf
Usage with OpenNMT-tf is pretty straightforward.
This package comes with a built-in tokenizer
in order to make usage of the ops.
-
Before training your model, register the tokenizer as follows :
from tensorflow_onmttok import register_opennmt_in_graph_tokenizer register_opennmt_in_graph_tokenizer()
See the complete example
-
Now that the tokenizer is registered, you can use the
OpenNMTInGraphTokenizer
class instead ofOpenNMTTokenizer
in your tokenization configuration files, e.g. :type: OpenNMTInGraphTokenizer params: mode: conservative case_feature: true
-
That's it ! You can now train your model as usual. Your
ExportedModel
will now expect atext
input instead oftokens
andlength
.Note: Tokenization resources will not be exported to the
assets.extra
directory.
Build TF Serving with this Ops
This guide will show you how to build TensorFlow Serving with this ops.
Prerequisites
- You have already cloned the
TF Serving
>= 2.1.0
repository, and have all tools installed for building it - You have installed CMake
3.1.0
or newer
Building
Add the Ops sources
First, download the release of your choice.
Inside the TF Serving sources folder, create a directory
named custom_ops
and copy the content of the tensorflow_onmttok
directory into it.
$ cd <tf_serving_sources>
$ mkdir tensorflow_serving/custom_ops
$ cp -r <op_sources>/tensorflow_onmttok tensorflow_serving/custom_ops
Reference the Ops
Edit tensorflow_serving/model_servers/BUILD
to reference
the Ops build target :
SUPPORTED_TENSORFLOW_OPS = [
...
"//tensorflow_serving/custom_ops/tensorflow_onmttok:onmttok_ops"
]
Build OpenNMT Tokenizer from sources
The last step is to build a static version of the
OpenNMT Tokenizer library.
This repository provides a shell script
that will build it with CMake.
$ cd <op_sources>
$ chmod +x build_tokenizer.sh && ./build_tokenizer.sh
Note: Pass
sudo
argument to thebuild_tokenizer.sh
script to execute themake install
command with sudo.
Build TensorFlow Serving
You can now build TensorFlow Serving as usual.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for tensorflow_onmttok_ops-0.4.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75eb8962f0af155244724c64e1dd48e985abe27dd773fd2762d782bcccdfdde8 |
|
MD5 | ecedc11d1438f9799a6205b76fe4c427 |
|
BLAKE2b-256 | 19685818031172da3dce2558be7dbca0863b92de4c3151edf8a8f3dc81df4836 |
Hashes for tensorflow_onmttok_ops-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc9dc0a31d9a9786bd869246c87d5efdfe0abd84ef5afacd395d97a38d566fd8 |
|
MD5 | 93db0ecdc824b2b9cae71f2a4651661b |
|
BLAKE2b-256 | 708608c8768f449aed80983641d30e3dd93d7e220dab315b1a2b6ce17a870bbf |
Hashes for tensorflow_onmttok_ops-0.4.0-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c32c2d23e17a48fb7338359a42919da80f201c4dd305a6e804a4563d0457012 |
|
MD5 | 8bca973f4bc33264c15aa81b7d945000 |
|
BLAKE2b-256 | ee263c07030c7adb4cd33d1162c5f7a18b0ff431f409cbb81c6c218d29d3f1a8 |
Hashes for tensorflow_onmttok_ops-0.4.0-cp35-cp35m-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3aa028aab720c7dde021e89394a7c15be9655e7992ce38122a0eb3a750aeea37 |
|
MD5 | 5ebdb34ad6469b967b0b168098305d9f |
|
BLAKE2b-256 | eaed8fed6a5c4ed31c1dd32fe8a70b5814a7ad70ff6eeb078c3e633f27a9bbfd |