Skip to main content

A Romanian WordPiece tokenizer

Project description

A Romanian WordPiece Tokenizer for use with HuggingFace models

This is a 'proper' Romanian WordPiece tokenizer, to be used with the HuggingFace tokenizers library.

It will do the following:

  1. replace all improper Romanian diacritics 'ş' and 'ţ' with their correct versions 'ș' and 'ț'.
  2. properly split the Romanian clitics glued to nouns, prepositions, verbs, etc.
  3. automatically enforce the current Romanian Academy rules of writing using 'â' and 'sunt/suntem/sunteți' forms of the 'a fi' verb.

The tokenizer will be trained on a cleaned version of the CoRoLa corpus. The corpus has 35.999.401 sentences and 763.531.321 words (split with wc -w Linux utility).

PyPI package

The tokenizer is now available on PyPI, and can be installed with the command pip install rwpt.

Usage example

from rwpt import RoBertWordPieceTokenizer, get_bundled_vocab_file_path

corola_vocab_file = get_bundled_vocab_file_path()
tokenizer = RoBertWordPieceTokenizer.from_file(vocab=corola_vocab_file)

input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer.encode(sequence=input_text)
# We have Romanian tokens such as the clitic pronoun '-mi' or
# the MWE 'în principiu'. Also, the incorrect form of the verb 'Sîntem'
# is normalized as 'Suntem'.
assert result_encoded.tokens[0] == 'Suntem'
assert result_encoded.tokens[6] == '-mi'
assert result_encoded.tokens[9] == 'în principiu'

result_decoded = tokenizer.decode(ids=result_encoded.ids)

assert result_decoded == 'Suntem OK și ar trebui să -mi meargă, în principiu.'

Full Romanian decoding isn't currently working (please notice the space between 'să' and '-mi') because decoders.Decoder.custom() is not implmemented yet in the tokenizers library.

Transformers usage example

In order to use the tokenizer with the __call__ method (as preferred in the Transformers documentation), do the following:

from rwpt import RoBertPreTrainedTokenizer, get_bundled_vocab_file_path

corola_vocab_file = get_bundled_vocab_file_path()
tokenizer = RoBertPreTrainedTokenizer.from_pretrained(
    corola_vocab_file, model_max_length=256)
input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer(text=input_text, padding='max_length')

The example above can be simplified as:

from rwpt import load_ro_pretrained_tokenizer

tokenizer = load_ro_pretrained_tokenizer(max_sequence_len=256)
input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer(text=input_text, padding='max_length')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rwpt-1.0.2.tar.gz (4.9 MB view details)

Uploaded Source

File details

Details for the file rwpt-1.0.2.tar.gz.

File metadata

  • Download URL: rwpt-1.0.2.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for rwpt-1.0.2.tar.gz
Algorithm Hash digest
SHA256 20ad530a65a31c673b3ba1e9b7063e72309351c589b0a2e3a8d7b2373985c91f
MD5 683a67eb4a214f967052b49e289c9354
BLAKE2b-256 5673e01f98fdf792af3fac3734e79ef7ae6a5dcabc5ac8385fa2c144808935d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page