A Romanian WordPiece tokenizer

These details have not been verified by PyPI

Project links

Homepage

Project description

A Romanian WordPiece Tokenizer for use with HuggingFace models

This is a 'proper' Romanian WordPiece tokenizer, to be used with the HuggingFace tokenizers library.

It will do the following:

replace all improper Romanian diacritics 'ş' and 'ţ' with their correct versions 'ș' and 'ț'.
properly split the Romanian clitics glued to nouns, prepositions, verbs, etc.
automatically enforce the current Romanian Academy rules of writing using 'â' and 'sunt/suntem/sunteți' forms of the 'a fi' verb.

The tokenizer will be trained on a cleaned version of the CoRoLa corpus. The corpus has 35.999.401 sentences and 763.531.321 words (split with wc -w Linux utility).

PyPI package

The tokenizer is now available on PyPI, and can be installed with the command pip install rwpt.

Usage example

from rwpt import RoBertWordPieceTokenizer, get_bundled_vocab_file_path

corola_vocab_file = get_bundled_vocab_file_path()
tokenizer = RoBertWordPieceTokenizer.from_file(vocab=corola_vocab_file)

input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer.encode(sequence=input_text)
# We have Romanian tokens such as the clitic pronoun '-mi' or
# the MWE 'în principiu'. Also, the incorrect form of the verb 'Sîntem'
# is normalized as 'Suntem'.
assert result_encoded.tokens[0] == 'Suntem'
assert result_encoded.tokens[6] == '-mi'
assert result_encoded.tokens[9] == 'în principiu'

result_decoded = tokenizer.decode(ids=result_encoded.ids)

assert result_decoded == 'Suntem OK și ar trebui să -mi meargă, în principiu.'

Full Romanian decoding isn't currently working (please notice the space between 'să' and '-mi') because decoders.Decoder.custom() is not implmemented yet in the tokenizers library.

Transformers usage example

In order to use the tokenizer with the __call__ method (as preferred in the Transformers documentation), do the following:

from rwpt import RoBertPreTrainedTokenizer, get_bundled_vocab_file_path

corola_vocab_file = get_bundled_vocab_file_path()
tokenizer = RoBertPreTrainedTokenizer.from_pretrained(
    corola_vocab_file, model_max_length=256)
input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer(text=input_text, padding='max_length')

The example above can be simplified as:

from rwpt import load_ro_pretrained_tokenizer

tokenizer = load_ro_pretrained_tokenizer(max_sequence_len=256)
input_text = "\t\tSîntem OK şi ar trebui să-mi meargă, în principiu.\n\n"
result_encoded = tokenizer(text=input_text, padding='max_length')

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.2

Jan 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rwpt-1.0.2.tar.gz (4.9 MB view details)

Uploaded Jan 15, 2024 Source

File details

Details for the file rwpt-1.0.2.tar.gz.

File metadata

Download URL: rwpt-1.0.2.tar.gz
Upload date: Jan 15, 2024
Size: 4.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for rwpt-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`20ad530a65a31c673b3ba1e9b7063e72309351c589b0a2e3a8d7b2373985c91f`
MD5	`683a67eb4a214f967052b49e289c9354`
BLAKE2b-256	`5673e01f98fdf792af3fac3734e79ef7ae6a5dcabc5ac8385fa2c144808935d1`

See more details on using hashes here.

rwpt 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A Romanian WordPiece Tokenizer for use with HuggingFace models

PyPI package

Usage example

Transformers usage example

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes