Skip to main content

Universal pre-token language adaptation layer for text-based LLMs

Project description

pretok

CI PyPI version Python 3.11+ License: MIT Ruff

Universal pre-token language adaptation layer for text-based LLMs.

pretok enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supports—all before tokenization, without modifying the model or tokenizer.

✨ Features

  • Model-Agnostic: Works with any text-based LLM (local, remote, open-source, proprietary)
  • Pre-Token Boundary: All transformations occur on raw text before tokenization
  • Prompt Structure Preservation: Role markers, delimiters, code blocks, and control tokens are preserved
  • Flexible Translation: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
  • Pluggable Backends: Support for multiple detection and translation engines
  • Explicit Capability Contracts: Models declare their supported languages

🚀 Installation

pip install pretok

Or with uv:

uv add pretok

Optional Dependencies

# Language detection
pip install pretok[fasttext]      # FastText (high accuracy)
pip install pretok[langdetect]    # langdetect (pure Python)

# Translation backends
pip install pretok[nllb]          # Meta's NLLB model (local)
pip install pretok[openai]        # OpenAI API

# All features
pip install pretok[all]

📖 Quick Start

from pretok import Pretok, create_pretok

# Create with default settings
pretok = Pretok(target_language="en")

# Process text
result = pretok.process("Bonjour, comment ca va?")

print(result.processed_text)  # "Hello, how are you?"
print(result.was_modified)    # True

With Model-Specific Optimization

# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4")     # Uses English
pretok = create_pretok(model_id="qwen-7b")   # Uses Chinese

With Custom Translation Backend

from pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator

# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
    base_url="https://api.openai.com/v1",  # Or OpenRouter, Ollama, vLLM
    model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)

Preserving Prompt Structure

prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of Japan?
<|im_end|>"""

result = pretok.process(prompt)
# Role markers preserved, only content translated

Configuration

Create a pretok.yaml:

version: "1.0"

pipeline:
  default_detector: langdetect
  cache_enabled: true

translation:
  llm:
    base_url: "https://api.openai.com/v1"
    model: "gpt-4o-mini"

cache:
  memory:
    max_size: 1000
    ttl: 3600
from pretok import Pretok
from pretok.config import load_config

config = load_config("pretok.yaml")
pretok = Pretok(config=config)

🏗️ Architecture

Input Text (any language)
        ↓
Segment Parsing (roles, code, text)
        ↓
Language Detection
        ↓
Translation Decision
        ↓
Translation (if needed)
        ↓
Prompt Reconstruction
        ↓
Tokenizer (unchanged)
        ↓
LLM Inference

📚 Documentation

🛠️ Development

# Clone the repository
git clone https://github.com/yen0304/pretok.git
cd pretok

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src/ tests/

# Run type checking
uv run mypy src/

📄 License

MIT License - see LICENSE for details.

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pretok-0.1.0.tar.gz (57.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pretok-0.1.0-py3-none-any.whl (40.1 kB view details)

Uploaded Python 3

File details

Details for the file pretok-0.1.0.tar.gz.

File metadata

  • Download URL: pretok-0.1.0.tar.gz
  • Upload date:
  • Size: 57.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.9

File hashes

Hashes for pretok-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9a0eb181c0b8a26601146d8d4fbe9bdb1cdc2ad2692aadcc77eecb42a48187bc
MD5 45bda68e9e1fa1f3702e3ab48d044f16
BLAKE2b-256 1edc0c0cf768ce878f84169090dbfb22c5338703649b06984c395931dc98ddf7

See more details on using hashes here.

File details

Details for the file pretok-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pretok-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.9

File hashes

Hashes for pretok-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f7f67b68be892ec7bb15ecc2469c649d4add59735404e36fd2e42568a6298671
MD5 61ef620ae511b6a780256045c928ee6a
BLAKE2b-256 71d52ff1e8316f6f3bd6eee04966c70690e352d578e624a89a8b00171ac01e7b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page