A Romanian WordPiece tokenizer
Project description
A Romanian WordPiece Tokenizer for use with HuggingFace models
This is a 'proper' Romanian WordPiece tokenizer, to be used with the HuggingFace tokenizers library.
It will do the following:
- replace all improper Romanian diacritics 'ş' and 'ţ' with their correct versions 'ș' and 'ț'.
- properly split the Romanian clitics glued to nouns, prepositions, verbs, etc.
- automatically enforce the current Romanian Academy rules of writing using 'â' and 'sunt/suntem/sunteți' forms of the 'a fi' verb.
The tokenizer will be trained on a cleaned version of the CoRoLa corpus.
The corpus has 35.999.401 sentences and 763.531.321 words (split with wc -w Linux utility).
PyPI package
The tokenizer is now available on PyPI, and can be installed with the command pip install rwpt.
Usage example
from rwpt import RoBertWordPieceTokenizer, get_bundled_vocab_file_path
corola_vocab_file = get_bundled_vocab_file_path()
tokenizer = RoBertWordPieceTokenizer.from_file(vocab=corola_vocab_file)
input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer.encode(sequence=input_text)
# We have Romanian tokens such as the clitic pronoun '-mi' or
# the MWE 'în principiu'. Also, the incorrect form of the verb 'Sîntem'
# is normalized as 'Suntem'.
assert result_encoded.tokens[0] == 'Suntem'
assert result_encoded.tokens[6] == '-mi'
assert result_encoded.tokens[9] == 'în principiu'
result_decoded = tokenizer.decode(ids=result_encoded.ids)
assert result_decoded == 'Suntem OK și ar trebui să -mi meargă, în principiu.'
Full Romanian decoding isn't currently working (please notice the space between 'să' and '-mi') because decoders.Decoder.custom() is not implmemented yet in the tokenizers library.
Transformers usage example
In order to use the tokenizer with the __call__ method (as preferred in the Transformers documentation), do the following:
from rwpt import RoBertPreTrainedTokenizer, get_bundled_vocab_file_path
corola_vocab_file = get_bundled_vocab_file_path()
tokenizer = RoBertPreTrainedTokenizer.from_pretrained(
corola_vocab_file, model_max_length=256)
input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer(text=input_text, padding='max_length')
The example above can be simplified as:
from rwpt import load_ro_pretrained_tokenizer
tokenizer = load_ro_pretrained_tokenizer(max_sequence_len=256)
input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer(text=input_text, padding='max_length')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file rwpt-1.0.2.tar.gz.
File metadata
- Download URL: rwpt-1.0.2.tar.gz
- Upload date:
- Size: 4.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20ad530a65a31c673b3ba1e9b7063e72309351c589b0a2e3a8d7b2373985c91f
|
|
| MD5 |
683a67eb4a214f967052b49e289c9354
|
|
| BLAKE2b-256 |
5673e01f98fdf792af3fac3734e79ef7ae6a5dcabc5ac8385fa2c144808935d1
|