No project description provided
Project description
Tokenization of Multilingual Texts using Language-Specific Tokenizers
Approaches
- Approach 1: Individual tokenizers for each language
- Approach 2: Unified tokenization approach across languages using utf-8 encondings
Evaluation
- Evaluation Methodologies
- Data Collection and Analysis
- Comparative Analysis
- Implementation Plan
- Future Expansion
Development Setup
Prerequisites
- Use the Dev Container for easy setup
- Install dev dependencies
pip install poetry poetry install
Linting, Formatting and Type Checking
- Add the directory to safe.directory
git config --global --add safe.directory /workspaces/multi-tokenizer
- Run the following command to lint and format the code
pre-commit run --all-files
- To install pre-commit hooks, run the following command (Recommended)
pre-commit install
Running the tests
Run the tests using the following command
pytest -n "auto"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
multi_tokenizer-0.1.0.tar.gz
(2.0 kB
view hashes)
Built Distribution
Close
Hashes for multi_tokenizer-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57e0c7db4603138ad565da11a5e2caaf6cf381dd1892f2d8ff5b975cba4615d2 |
|
MD5 | dba89fd0c061af187b5f7729a5113d70 |
|
BLAKE2b-256 | 2cf6459c1117fdae910dd45ae3da6e7ec3fe4a3b4ae30fd3f8a134a21a9c1ca2 |