Escape unknown symbols in SentecePiece vocabularies
Project description
escape-unk
Escape unknown symbols in SentecePiece vocabularies. This is particulary useful for MarianNMT toolkit which does not support replacing unknown tokens with most attentive word in the source (see here, thanks to @emjotde for the idea).
IMPORTANT NOTE: this solution is far from ideal, as the model, especially if it has not been trained with escaped chars, may fail to copy the escaped unknown characters. Ideally, you should train your SentencePiece vocabulary with --byte_fallback
option. This is just a workaround for scenarios where model does not have byte fallback or can not be re-trained.
Install
Just install it from PyPi
pip install escape-unk
Background
There are some scenarios where your machine translation model has to translate sentencences containing characters unknown for the SentencePiece vocabulary. Neural models usually start to hallucinate, throw out garbage or just don't know hot to translate when an unknown character comes to the input. In the cases where those characters simply need to be copied, escaping them to their hexadecimal representation, can be useful if the model manages to copy the escaped symbols.
Escape Chinese characters in an English-German vocabulary is just like:
echo "Beijing (Chinese: 北京) is the capital of the People's Republic of China" | escape-unk -m vocab.deen.spm
Beijing (Chinese: [[e58c97e4baac]]) is the capital of the People's Republic of China
or escaping emojis
echo "I ❤️ you" | escape-unk -m vocab.deen.spm
I [[e29da4efb88f]] you
So instead of:
echo "Beijing (Chinese: 北京) is the capital of the People's Republic of China" | marian-decoder -c model.config.yml
Peking (chinesisch: ) ist die Hauptstadt der Volksrepublik China
we will have:
echo "Beijing (Chinese: 北京) is the capital of the People's Republic of China" | escape-unk -m vocab.deen.spm | marian-decoder -c model.config.yml
Beijing (chinesisch: [[e58c97e4baac]]) ist die Hauptstadt der Volksrepublik China
and the full pipeline with unescape-unk
:
echo "Beijing ..." | escape-unk -m vocab.deen.spm | marian-decode -c config.yml | unescape-unk
Beijing (chinesisch: 北京) ist die Hauptstadt der Volksrepublik China
WARNING: if an escaped sequence is not correctly copied by the translator and generates an invalid sequence,
the character is omitted and substituted by an empty string.
If you want it to fail when this happens, use --strict
/-s
mode with unescape-unk
command.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file escape_unk-1.4.tar.gz
.
File metadata
- Download URL: escape_unk-1.4.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c9f9e589439f244eea2bd78f22962a26b1118b3aa97e45075052b4e50c0a838 |
|
MD5 | ab319869d683147bb51433bf2606c3ff |
|
BLAKE2b-256 | 37f37a068835812e594dcfb5b0647dcfd729300815e8a206e413cf8a930240fa |
File details
Details for the file escape_unk-1.4-py3-none-any.whl
.
File metadata
- Download URL: escape_unk-1.4-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c6d1e8733f6474e43a2856099ef2e12bc3bbd9617eb385076983ec6a8035224 |
|
MD5 | fbf0e205c75b0d5d4ac96eef8d839428 |
|
BLAKE2b-256 | 8d97f2ad92eb73ba133b55cc6629c9623c49a8d2c54178c634e255010c720241 |