Skip to main content

No project description provided

Project description

Tokenization of Multilingual Texts using Language-Specific Tokenizers

Approaches

  1. Approach 1: Individual tokenizers for each language
  2. Approach 2: Unified tokenization approach across languages using utf-8 encondings

Evaluation

Development Setup

Prerequisites

  • Use the Dev Container for easy setup
  • Install dev dependencies
    pip install poetry
    poetry install
    

Linting, Formatting and Type Checking

  • Add the directory to safe.directory
    git config --global --add safe.directory /workspaces/multi-tokenizer
    
  • Run the following command to lint and format the code
    pre-commit run --all-files
    
  • To install pre-commit hooks, run the following command (Recommended)
    pre-commit install
    

Running the tests

Run the tests using the following command

pytest -n "auto"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multi_tokenizer-0.1.0.tar.gz (2.0 kB view hashes)

Uploaded Source

Built Distribution

multi_tokenizer-0.1.0-py3-none-any.whl (2.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page