Skip to main content

Python package that provides tokenization of multilingual texts using language-specific tokenizers

Project description

Multi-Tokenizer

Tokenization of Multilingual Texts using Language-Specific Tokenizers

PyPI version

Overview

Multi-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses lingua library to detect the language of the text segments, the tokenizers library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.

Installation

Using pip

pip install multi-tokenizer

from Source

git clone https://github.com/chandralegend/multi-tokenizer.git
cd multi-tokenizer
pip install .

Usage

from multi_tokenizer import MultiTokenizer, PretrainedTokenizers

# specify the language tokenizers to be used
lang_tokenizers = [
    PretrainedTokenizers.ENGLISH,
    PretrainedTokenizers.CHINESE,
    PretrainedTokenizers.HINDI,
]

# create a multi-tokenizer object (split_text=True to split the text into segments, for better language detection)
tokenizer = MultiTokenizer(lang_tokenizers, split_text=True)

sentence = "Translate this hindi sentence to english - बिल्ली बहुत प्यारी है."

# Pretokenize the text
pretokenized_text = tokenizer.pre_tokenize(sentence) # [('<EN>', (0, 1)), ('Translate', (1, 10)), ('Ġthis', (10, 15)), ('Ġhindi', (15, 21)), ...]

# Encode the text
ids, tokens = tokenizer.encode(sentence) # [3, 7235, 6614, 86, 755, 775, 10763, 83, 19412, 276, ...], ['<EN>', 'Tr', 'ans', 'l', 'ate', 'Ġthis', 'Ġhind', ...]

# Decode the tokens
decoded_text = tokenizer.decode(ids) # Translate this hindi sentence to english - बिल्ली बहुत प्यारी है.

Development Setup

Prerequisites

  • Use the VSCode Dev Containers for easy setup (Recommended)
  • Install dev dependencies
    pip install poetry
    poetry install
    

Linting, Formatting and Type Checking

  • Add the directory to safe.directory
    git config --global --add safe.directory /workspaces/multi-tokenizer
    
  • Run the following command to lint and format the code
    pre-commit run --all-files
    
  • To install pre-commit hooks, run the following command (Recommended)
    pre-commit install
    

Running the tests

Run the tests using the following command

pytest -n "auto"

Approaches

  1. Approach 1: Individual tokenizers for each language
  2. Approach 2: Unified tokenization approach across languages using utf-8 encondings

Evaluation

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multi_tokenizer-0.1.4.tar.gz (936.6 kB view details)

Uploaded Source

Built Distribution

multi_tokenizer-0.1.4-py3-none-any.whl (958.1 kB view details)

Uploaded Python 3

File details

Details for the file multi_tokenizer-0.1.4.tar.gz.

File metadata

  • Download URL: multi_tokenizer-0.1.4.tar.gz
  • Upload date:
  • Size: 936.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.5.0-1025-azure

File hashes

Hashes for multi_tokenizer-0.1.4.tar.gz
Algorithm Hash digest
SHA256 cf86d1e6903b10111352016f1f421fc82b6cb5b4f2a53a382c647a3224bf0fb1
MD5 367c6e6fba0d7984c73fbaf98ede4ee3
BLAKE2b-256 18c5734596ae2a84f8493317d01f40a034bb78a48ba7b217828684cd178b76dd

See more details on using hashes here.

File details

Details for the file multi_tokenizer-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: multi_tokenizer-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 958.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.5.0-1025-azure

File hashes

Hashes for multi_tokenizer-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8a972a438826d77caad0a28cf1d961028c6f4b4a7a85e27d19694ddbca9dc859
MD5 6bf499e19f7af8506344dd0543cf1cc4
BLAKE2b-256 50d99637fe6da732657c9fde6461880c48469d408fae679456aa8d7301779fa8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page