Skip to main content

Universal pre-token language adaptation layer for text-based LLMs

Project description

pretok logo

pretok

CI codecov PyPI version Python 3.11+ License: MIT Ruff

Universal pre-token language adaptation layer for text-based LLMs.

pretok enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supports—all before tokenization, without modifying the model or tokenizer.

✨ Features

  • Model-Agnostic: Works with any text-based LLM (local, remote, open-source, proprietary)
  • Pre-Token Boundary: All transformations occur on raw text before tokenization
  • Prompt Structure Preservation: Role markers, delimiters, code blocks, and control tokens are preserved
  • Flexible Translation: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
  • Pluggable Backends: Support for multiple detection and translation engines
  • Explicit Capability Contracts: Models declare their supported languages

🚀 Installation

pip install pretok

Or with uv:

uv add pretok

Optional Dependencies

# Language detection
pip install pretok[fasttext]      # FastText (high accuracy)
pip install pretok[langdetect]    # langdetect (pure Python)

# Translation backends
pip install pretok[nllb]          # Meta's NLLB model (local)
pip install pretok[openai]        # OpenAI API

# All features
pip install pretok[all]

📖 Quick Start

from pretok import Pretok, create_pretok

# Create with default settings
pretok = Pretok(target_language="en")

# Process text
result = pretok.process("Bonjour, comment ca va?")

print(result.processed_text)  # "Hello, how are you?"
print(result.was_modified)    # True

With Model-Specific Optimization

# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4")     # Uses English
pretok = create_pretok(model_id="qwen-7b")   # Uses Chinese

With Custom Translation Backend

from pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator

# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
    base_url="https://api.openai.com/v1",  # Or OpenRouter, Ollama, vLLM
    model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)

Preserving Prompt Structure

prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of Japan?
<|im_end|>"""

result = pretok.process(prompt)
# Role markers preserved, only content translated

Configuration

Create a pretok.yaml:

version: "1.0"

pipeline:
  default_detector: langdetect
  cache_enabled: true

translation:
  llm:
    base_url: "https://api.openai.com/v1"
    model: "gpt-4o-mini"

cache:
  memory:
    max_size: 1000
    ttl: 3600
from pretok import Pretok
from pretok.config import load_config

config = load_config("pretok.yaml")
pretok = Pretok(config=config)

🏗️ Architecture

Input Text (any language)
        ↓
Segment Parsing (roles, code, text)
        ↓
Language Detection
        ↓
Translation Decision
        ↓
Translation (if needed)
        ↓
Prompt Reconstruction
        ↓
Tokenizer (unchanged)
        ↓
LLM Inference

📚 Documentation

🛠️ Development

# Clone the repository
git clone https://github.com/yen0304/pretok.git
cd pretok

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src/ tests/

# Run type checking
uv run mypy src/

📄 License

MIT License - see LICENSE for details.

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pretok-0.1.1.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pretok-0.1.1-py3-none-any.whl (40.9 kB view details)

Uploaded Python 3

File details

Details for the file pretok-0.1.1.tar.gz.

File metadata

  • Download URL: pretok-0.1.1.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.9

File hashes

Hashes for pretok-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8f4229092c0452dfd62829b93a57780feb5e0595ccd7cd44b0752c01674ff93b
MD5 2ecdf2d5e2753eaedb41509fba5263d0
BLAKE2b-256 3ca1a1e30af3356e6235ad4880b1f0312674ce6666471829db105c9943398cc8

See more details on using hashes here.

File details

Details for the file pretok-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pretok-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 40.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.9

File hashes

Hashes for pretok-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 61026d5446c2fb08bc5b6350146af173654fde6605c6cee08f682c1ab3ab25f6
MD5 2ccf0c3701bd1c9644e6f4685edd6c7b
BLAKE2b-256 13dd871f214daae38561d204c3bcc42a48ce0985f45abcfd943f60e4d87d69b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page