Python package that provides tokenization of multilingual texts using language-specific tokenizers
Project description
Multi-Tokenizer
Tokenization of Multilingual Texts using Language-Specific Tokenizers
Overview
Multi-Tokenizer is a Python package that provides tokenization of multilingual texts using language-specific tokenizers. The package is designed to be used in a variety of applications, including natural language processing, machine learning, and data analysis. Behind the scenes, the package uses lingua
library to detect the language of the text segments, the tokenizers
library to create language-specific tokenizers, and then tokenizes the text segments using the appropriate tokenizer. Multi-tokenizer introduces additional special tokens to handle the language-specific tokenization, which can be used to reconstruct the original text segments after tokenization and allows the models to differentiate between the languages in the text segments.
Installation
Using pip
pip install multi-tokenizer
from Source
git clone https://github.com/chandralegend/multi-tokenizer.git
cd multi-tokenizer
pip install .
Usage
from multi_tokenizer import MultiTokenizer, PretrainedTokenizers
# specify the language tokenizers to be used
lang_tokenizers = [
PretrainedTokenizers.ENGLISH,
PretrainedTokenizers.CHINESE,
PretrainedTokenizers.HINDI,
]
# create a multi-tokenizer object (split_text=True to split the text into segments, for better language detection)
tokenizer = MultiTokenizer(lang_tokenizers, split_text=True)
sentence = "Translate this hindi sentence to english - बिल्ली बहुत प्यारी है."
# Pretokenize the text
pretokenized_text = tokenizer.pre_tokenize(sentence) # [('<EN>', (0, 1)), ('Translate', (1, 10)), ('Ġthis', (10, 15)), ('Ġhindi', (15, 21)), ...]
# Encode the text
ids, tokens = tokenizer.encode(sentence) # [3, 7235, 6614, 86, 755, 775, 10763, 83, 19412, 276, ...], ['<EN>', 'Tr', 'ans', 'l', 'ate', 'Ġthis', 'Ġhind', ...]
# Decode the tokens
decoded_text = tokenizer.decode(ids) # Translate this hindi sentence to english - बिल्ली बहुत प्यारी है.
Development Setup
Prerequisites
- Use the VSCode Dev Containers for easy setup (Recommended)
- Install dev dependencies
pip install poetry poetry install
Linting, Formatting and Type Checking
- Add the directory to safe.directory
git config --global --add safe.directory /workspaces/multi-tokenizer
- Run the following command to lint and format the code
pre-commit run --all-files
- To install pre-commit hooks, run the following command (Recommended)
pre-commit install
Running the tests
Run the tests using the following command
pytest -n "auto"
Approaches
- Approach 1: Individual tokenizers for each language
- Approach 2: Unified tokenization approach across languages using utf-8 encondings
Evaluation
- Evaluation Methodologies
- Data Collection and Analysis
- Comparative Analysis
- Implementation Plan
- Future Expansion
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file multi_tokenizer-0.1.4.tar.gz
.
File metadata
- Download URL: multi_tokenizer-0.1.4.tar.gz
- Upload date:
- Size: 936.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.5.0-1025-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf86d1e6903b10111352016f1f421fc82b6cb5b4f2a53a382c647a3224bf0fb1 |
|
MD5 | 367c6e6fba0d7984c73fbaf98ede4ee3 |
|
BLAKE2b-256 | 18c5734596ae2a84f8493317d01f40a034bb78a48ba7b217828684cd178b76dd |
File details
Details for the file multi_tokenizer-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: multi_tokenizer-0.1.4-py3-none-any.whl
- Upload date:
- Size: 958.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Linux/6.5.0-1025-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a972a438826d77caad0a28cf1d961028c6f4b4a7a85e27d19694ddbca9dc859 |
|
MD5 | 6bf499e19f7af8506344dd0543cf1cc4 |
|
BLAKE2b-256 | 50d99637fe6da732657c9fde6461880c48469d408fae679456aa8d7301779fa8 |