Skip to main content

Universal pre-token language adaptation layer for text-based LLMs

Project description

pretok logo

pretok

CI codecov PyPI version Python 3.11+ License: MIT Ruff

Universal pre-token language adaptation layer for text-based LLMs.

pretok enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supports—all before tokenization, without modifying the model or tokenizer.

✨ Features

  • Model-Agnostic: Works with any text-based LLM (local, remote, open-source, proprietary)
  • Pre-Token Boundary: All transformations occur on raw text before tokenization
  • Prompt Structure Preservation: Role markers, delimiters, code blocks, and control tokens are preserved
  • Flexible Translation: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
  • Pluggable Backends: Support for multiple detection and translation engines
  • Explicit Capability Contracts: Models declare their supported languages

🚀 Installation

pip install pretok

Or with uv:

uv add pretok

Optional Dependencies

# Language detection
pip install pretok[fasttext]      # FastText (high accuracy)
pip install pretok[langdetect]    # langdetect (pure Python)

# Translation backends
pip install pretok[nllb]          # Meta's NLLB model (local)
pip install pretok[openai]        # OpenAI API

# All features
pip install pretok[all]

📖 Quick Start

from pretok import Pretok, create_pretok

# Create with default settings
pretok = Pretok(target_language="en")

# Process text
result = pretok.process("Bonjour, comment ca va?")

print(result.processed_text)  # "Hello, how are you?"
print(result.was_modified)    # True

With Model-Specific Optimization

# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4")     # Uses English
pretok = create_pretok(model_id="qwen-7b")   # Uses Chinese

With Custom Translation Backend

from pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator

# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
    base_url="https://api.openai.com/v1",  # Or OpenRouter, Ollama, vLLM
    model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)

Preserving Prompt Structure

prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of Japan?
<|im_end|>"""

result = pretok.process(prompt)
# Role markers preserved, only content translated

Configuration

Create a pretok.yaml:

version: "1.0"

pipeline:
  default_detector: langdetect
  cache_enabled: true

translation:
  llm:
    base_url: "https://api.openai.com/v1"
    model: "gpt-4o-mini"

cache:
  memory:
    max_size: 1000
    ttl: 3600
from pretok import Pretok
from pretok.config import load_config

config = load_config("pretok.yaml")
pretok = Pretok(config=config)

🏗️ Architecture

Input Text (any language)
        ↓
Segment Parsing (roles, code, text)
        ↓
Language Detection
        ↓
Translation Decision
        ↓
Translation (if needed)
        ↓
Prompt Reconstruction
        ↓
Tokenizer (unchanged)
        ↓
LLM Inference

📚 Documentation

🛠️ Development

# Clone the repository
git clone https://github.com/yen0304/pretok.git
cd pretok

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src/ tests/

# Run type checking
uv run mypy src/

📄 License

MIT License - see LICENSE for details.

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pretok-0.2.0.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pretok-0.2.0-py3-none-any.whl (52.8 kB view details)

Uploaded Python 3

File details

Details for the file pretok-0.2.0.tar.gz.

File metadata

  • Download URL: pretok-0.2.0.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pretok-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a07bae9e6670d343af2d7801416174fc85ef19b4cbf68246763db894a9881de2
MD5 8ed74b200a4aee15c026acad52f28843
BLAKE2b-256 e60b2429d0abc66ec244c6e64f9914ba6f2aa97121841acb90cf6d5c02676aa7

See more details on using hashes here.

File details

Details for the file pretok-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pretok-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 52.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pretok-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8bb8eabe81c3d8a64953259ce6ff154cddb9c18091c75f009be0b0e82502260b
MD5 ddcf9d213ebf58958bb99d29b2b0cda0
BLAKE2b-256 88e504d88c8ef11578196cac624bc1568043bf191c2cbcec7b5b9b418d0cb26d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page