Universal pre-token language adaptation layer for text-based LLMs
Project description
pretok
Universal pre-token language adaptation layer for text-based LLMs.
pretok enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supports—all before tokenization, without modifying the model or tokenizer.
✨ Features
- Model-Agnostic: Works with any text-based LLM (local, remote, open-source, proprietary)
- Pre-Token Boundary: All transformations occur on raw text before tokenization
- Prompt Structure Preservation: Role markers, delimiters, code blocks, and control tokens are preserved
- Flexible Translation: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
- Pluggable Backends: Support for multiple detection and translation engines
- Explicit Capability Contracts: Models declare their supported languages
🚀 Installation
pip install pretok
Or with uv:
uv add pretok
Optional Dependencies
# Language detection
pip install pretok[fasttext] # FastText (high accuracy)
pip install pretok[langdetect] # langdetect (pure Python)
# Translation backends
pip install pretok[nllb] # Meta's NLLB model (local)
pip install pretok[openai] # OpenAI API
# All features
pip install pretok[all]
📖 Quick Start
from pretok import Pretok, create_pretok
# Create with default settings
pretok = Pretok(target_language="en")
# Process text
result = pretok.process("Bonjour, comment ca va?")
print(result.processed_text) # "Hello, how are you?"
print(result.was_modified) # True
With Model-Specific Optimization
# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4") # Uses English
pretok = create_pretok(model_id="qwen-7b") # Uses Chinese
With Custom Translation Backend
from pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator
# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
base_url="https://api.openai.com/v1", # Or OpenRouter, Ollama, vLLM
model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)
Preserving Prompt Structure
prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of Japan?
<|im_end|>"""
result = pretok.process(prompt)
# Role markers preserved, only content translated
Configuration
Create a pretok.yaml:
version: "1.0"
pipeline:
default_detector: langdetect
cache_enabled: true
translation:
llm:
base_url: "https://api.openai.com/v1"
model: "gpt-4o-mini"
cache:
memory:
max_size: 1000
ttl: 3600
from pretok import Pretok
from pretok.config import load_config
config = load_config("pretok.yaml")
pretok = Pretok(config=config)
🏗️ Architecture
Input Text (any language)
↓
Segment Parsing (roles, code, text)
↓
Language Detection
↓
Translation Decision
↓
Translation (if needed)
↓
Prompt Reconstruction
↓
Tokenizer (unchanged)
↓
LLM Inference
📚 Documentation
🛠️ Development
# Clone the repository
git clone https://github.com/yen0304/pretok.git
cd pretok
# Install dependencies
uv sync --dev
# Run tests
uv run pytest
# Run linting
uv run ruff check src/ tests/
# Run type checking
uv run mypy src/
📄 License
MIT License - see LICENSE for details.
🤝 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pretok-0.1.0.tar.gz.
File metadata
- Download URL: pretok-0.1.0.tar.gz
- Upload date:
- Size: 57.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a0eb181c0b8a26601146d8d4fbe9bdb1cdc2ad2692aadcc77eecb42a48187bc
|
|
| MD5 |
45bda68e9e1fa1f3702e3ab48d044f16
|
|
| BLAKE2b-256 |
1edc0c0cf768ce878f84169090dbfb22c5338703649b06984c395931dc98ddf7
|
File details
Details for the file pretok-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pretok-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7f67b68be892ec7bb15ecc2469c649d4add59735404e36fd2e42568a6298671
|
|
| MD5 |
61ef620ae511b6a780256045c928ee6a
|
|
| BLAKE2b-256 |
71d52ff1e8316f6f3bd6eee04966c70690e352d578e624a89a8b00171ac01e7b
|