A CLI translation tool using LLMs for document translation
Project description
๐ Tinbox: Your Ultimate CLI Translation Tool
Tinbox is a robust command-line tool designed to tackle the challenges of translating large documents, especially PDFs, using Large Language Models (LLMs). Unlike other tools, Tinbox excels in handling extensive document sizes and navigates around model limitations related to size and copyright issues, ensuring seamless and efficient translations.
Why Choose Tinbox?
- Handles Large Documents: Efficiently processes large PDFs and other document types.
- Overcomes Model Limitations: Bypasses common model refusals due to size or copyright concerns.
- No OCR Needed: Directly translates PDFs using advanced multimodal models.
- Smart Algorithms: Achieve optimal translation results with our intelligent algorithms.
- Local and Cloud Support: Use models locally or in the cloud, depending on your preference.
Quick Start Example:
tinbox --to es document.pdf
๐ฏ The Problems Tinbox Solves
-
PDF Translation Challenges
- Most tools require OCR, leading to formatting loss and errors
- Tinbox uses multimodal models to directly understand PDFs as images
-
Large Document Limitations
- Traditional tools often fail with large documents
- Models frequently refuse or timeout on big files
- Tinbox smartly splits and processes documents while maintaining context
-
Model Refusal Issues
- Many models refuse translation tasks due to:
- Copyright concerns
- Document size limitations
- Rate limiting
- Tinbox's algorithms work around these limitations intelligently
- Many models refuse translation tasks due to:
-
Quality and Consistency
- Smart algorithms ensure consistent translations across document sections
- Maintains context between pages and segments
- Repairs potential inconsistencies at section boundaries
๐ Key Highlights:
- Translate PDFs without OCR using advanced AI models
- Handle documents of any size with smart splitting algorithms
- Work around common model limitations and refusals
- Track costs and performance with built-in benchmarking
โจ Features
๐ Smart Document Handling
- PDFs: Processed directly as images - no OCR needed!
- Word (docx): Preserves formatting while translating
- Text files: Efficient processing for large files
๐ง Intelligent Translation
- Smart Algorithms:
- Page-by-Page with Seam Repair (default for PDF)
- Sliding Window for long text documents
- Automatic context preservation between sections
๐ค Flexible Model Support
- Use powerful cloud models (GPT-4V, Claude 3.5 Sonnet)
- Run translations locally with Ollama
- Mix and match models for different tasks
๐ Language Support
- Flexible source/target language specification using ISO 639-1 codes
- Common language aliases (e.g., 'en', 'zh', 'es')
- ๐ Benchmarking
- Track overall translation time and token usage/cost
- Compare algorithms or model providers side-by-side
๐ Getting Started
Quick Install
# Install base package
pip install tinbox
# For PDF support (recommended)
pip install tinbox[pdf]
# For Word document support
pip install tinbox[docx]
# Install everything
pip install tinbox[all]
Basic Usage
-
Translate a PDF to Spanish
tinbox --to es document.pdf
-
Translate a Word document from Chinese to English
tinbox --from zh --to en document.docx
-
Handle a large text file with custom settings
tinbox --to fr --algorithm sliding-window large_document.txt
๐ก Tips for Best Results
-
For Large Documents
- Use the sliding window algorithm:
--algorithm sliding-window - Adjust window size if needed:
--window-size 3000
- Use the sliding window algorithm:
-
For PDFs
- The default page-by-page algorithm works best
- No OCR needed - just point to your PDF!
-
For Best Performance
- Use local models via Ollama for faster processing
- Cloud models (GPT-4V, Claude) for highest quality
๐ Detailed Documentation
Command-Line Options
Core Options
| Option | Description | Example |
|---|---|---|
--from, -f |
Source language (auto-detect if not specified) | --from zh |
--to, -t |
Target language (default: English) | --to es |
--model |
Model to use for translation | --model gpt-4v |
--output, -o |
Output file (default: print to console) | --output translated.txt |
Algorithm Options
| Option | Description | Default |
|---|---|---|
--algorithm, -a |
Translation algorithm (page or sliding-window) |
page for PDF |
--window-size |
Size of translation window | 2000 tokens |
--overlap-size |
Overlap between windows | 200 tokens |
Output Format Options
| Option | Description | Example Output |
|---|---|---|
--format, -F |
Output format (text, json, markdown) | See examples below |
--benchmark, -b |
Include performance metrics | Translation time, costs |
Supported Languages
Common language codes (ISO 639-1):
| Code | Language | Also Accepts |
|---|---|---|
| en | English | eng |
| es | Spanish | spa |
| zh | Chinese | chi, cmn |
| fr | French | fra |
| de | German | deu, ger |
| ja | Japanese | jpn |
| ko | Korean | kor |
| ru | Russian | rus |
| ar | Arabic | ara |
| hi | Hindi | hin |
Output Format Examples
1. Plain Text (Default)
tinbox translate document.pdf --to es
# Output: Translated text...
2. JSON Output
tinbox translate document.pdf --to es --format json
Example response:
{
"metadata": {
"source_lang": "en",
"target_lang": "es",
"model": "claude-3-sonnet",
"algorithm": "page"
},
"result": {
"text": "Translated text...",
"tokens_used": 1500,
"cost": 0.045,
"time_taken": 12.5
}
}
3. Markdown Report
tinbox translate document.pdf --to es --format markdown
Advanced Usage
-
Handling Very Large Documents
tinbox --to es --algorithm sliding-window \ --window-size 3000 --overlap-size 300 \ large_document.pdf
-
Using Local Models
tinbox --to fr --model ollama:mistral-small document.txt
-
Benchmarking Different Models
tinbox --to de --benchmark --model gpt-4v document.pdf
๐ Project Structure
tinbox/
โโโ src/
โ โโโ tinbox/
โ โโโ cli.py # Command-line interface
โ โโโ core/ # Core functionality
โ โ โโโ cost.py # Cost tracking
โ โ โโโ processor/ # Document processors
โ โ โโโ translation/ # Translation algorithms
โ โโโ utils/ # Utilities
โโโ tests/ # Test suite
๐ Future Plans
-
Enhanced Output Formats
- PDF output with original formatting
- Word document export
- HTML with parallel text
-
Advanced Features
- AI-powered section detection
- Custom terminology support
- Interactive translation review
- Domain-specific model fine-tuning
-
Performance Improvements
- Parallel processing
- Better caching
- Reduced API costs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tinbox-0.1.0.tar.gz.
File metadata
- Download URL: tinbox-0.1.0.tar.gz
- Upload date:
- Size: 352.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08ce0331860edb09731966bc1aa4a8a42a9de9f3511cfdaca56c042d1b601236
|
|
| MD5 |
c199b74dbea99c7e236f14da2e567dc7
|
|
| BLAKE2b-256 |
54d0104db21e1fe4e66cdda6a14e435b3d9df168227865818044ea4437ec5ef0
|
File details
Details for the file tinbox-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tinbox-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89e18df20e4f87104226024a728682db8e82ddd5961fc8a28d1596978b0d1dc2
|
|
| MD5 |
a982648191089f1f990dce3bc234ca0d
|
|
| BLAKE2b-256 |
bcc26e6ea1d27e1cd9aac79b4fc3b8a5eaf014c94dc5639e88053c5713584caa
|