Enhancing translation with RAG-powered large language models
Project description
🦖 T-Ragx
Enhancing Translation with RAG-Powered Large Language Models
🚧 T-Ragx Demo [colab demo]🚧
🚧 T-Ragx Colab Tool [colab tool]🚧
TL;DR
Overview
- Democratize high-quality machine translations
- Open-soured system-level translation framework
- Fluent/ natural translations using LLM
- Private and secure local translations
- Zero-shot in-task translations
Methods
- QLoRA fine-tuned models
- General + in-task translation memory/ glossary
- Include preceding text for document-level translations for additional context
Results
- In-task translation memory and glossary achieved a significant (~45%) increase in aggregated translation scores on the QLoRA Mistral 7b model
- Great recall for valid translation memory/ glossary (i.e. previous translations/ character names)
- Outperforms native TowerInstruct on all 6 language directions finetuned (Ja x Zh x En)
- Outperforms DeepL in translating Japanese web novel (That Time I Got Reincarnated as a Slime) to Chinese with in-task
memories
- Japanese -> Chinese
- +29% by sacrebleu
- +0.4% by comet22
- Japanese -> Chinese
Getting Started
Install
Environment
Conda / Mamba (Recommended)
pip
Use your favourite virtual environment, and run:
pip install -r requirment.txt
Examples
Initiate the input processor:
import t_ragx
# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()
# Load/ point to the demo resources
input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host="t-ragx-fossil.rayliu.ca", elasticsearch_port=80)
Using the llama-cpp-python
backend:
import t_ragx
# T-Ragx currently support
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
filename="*Q4_K_M*",
# see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
# for other files
chat_format="mistral-instruct",
model_config={'n_ctx':2048}, # increase the context window
)
t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)
Translate!
t_ragx_translator.batch_translate(
source_text_list, # the input text list to translate
pre_text_list=pre_text_list, # optional, including the preceding context to translate the document level
# Can generate via:
# pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
source_lang_code='ja',
target_lang_code='en',
memory_search_args={'top_k': 3} # optional, pass additional arguments to input_processor.search_memory
)
Data Sources
Dataset | Translation Memory | Glossary | Training | Testing | License |
---|---|---|---|---|---|
OpenMantra | ✅ | ✅ | CC BY-NC 4.0 | ||
WMT < 2023 | ✅ | ✅ | for research | ||
ParaMed | ✅ | ✅ | cc-by-4.0 | ||
ted_talks_iwslt | ✅ | ✅ | cc-by-nc-nd-4.0 | ||
JESC | ✅ | ✅ | CC BY-SA 4.0 | ||
MTNT | ✅ | Custom/ Reddit API | |||
WCC-JC | ✅ | ✅ | for research | ||
ASPEC | ✅ | custom, for research | |||
All other ja-en/zh-en OPUS data | ✅ | mix of open licenses: check https://opus.nlpl.eu/ | |||
Wikidata | ✅ | CC0 | |||
Tensei Shitara Slime Datta Ken Wiki | ☑️ in task | CC BY-SA | |||
WMT 2023 | ✅ | for research | |||
Tensei Shitara Slime Datta Ken Web Novel & web translations | ☑️ in task | ✅ | Not included translation memory |
Elasticsearch
Note: you can access a read-only preview T-Ragx Elasticsearch service at http://t-ragx-fossil.rayliu.ca:80
(But you will need a personal Elasticsearch service to add your in-task memories)
Install using Docker
See the T-Rex-Fossil repo
Install Locally
Note: this project was built with Elasticsearch 7
- Download the Elasticsearch binary
- Unzip
- Enter into the unzipped folder
- Install the plugins
bin/elasticsearch-plugin install repository-s3
bin/elasticsearch-plugin install analysis-icu
bin/elasticsearch-plugin install analysis-kuromoji
bin/elasticsearch-plugin install analysis-smartcn
-
Add the S3 keys
(The snapshot is stored on IDrive e2 which is apparently not compatible with S3 enough for read-only Elastic S3 repo to work)
This read-only key will help you connect to the snapshot
bin/elasticsearch-keystore add s3.client.default.access_key
CG4KwcrNPefWdJcsBIUp
bin/elasticsearch-keystore add s3.client.default.secret_key
Cau5uITwZ7Ke9YHKvWE9cXuTy5chdapBLhqVaI3C
- Add the snapshot
curl -X PUT "http://localhost:9200/_snapshot/public_t_ragx_translation_memory" -H "Content-Type: application/json" -d "{\"type\":\"s3\",\"settings\":{\"bucket\":\"t-ragx-public\",\"base_path\":\"elastic\",\"endpoint\":\"o3t0.or.idrivee2-37.com\"}}"
Note: this is the JSON body:
{
"type": "s3",
"settings": {
"bucket": "t-ragx-public",
"base_path": "elastic",
"endpoint": "o3t0.or.idrivee2-37.com"
}
}
-
Restore the Snapshot
If you use any GUI client i.e. elasticvue, you likely could do this via their interface
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file t_ragx-0.0.2.tar.gz
.
File metadata
- Download URL: t_ragx-0.0.2.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9daad68350f64198c835188bf96e5a7a6dae2245b739a37c43c50f1e44e8fbb |
|
MD5 | 841ff41cbb6760d63a0f490f9986b227 |
|
BLAKE2b-256 | bfa8f04484d7c4f8b0db82a3c051584cb732dd55c8d528aec8ded8a6f290bcd2 |
File details
Details for the file t_ragx-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: t_ragx-0.0.2-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5b7b46aab61267ad39964a4eb51ad4f97b24c3b002d92596e615be1ff65f824 |
|
MD5 | f521a9afe549dab304a5f68a376131cf |
|
BLAKE2b-256 | 8d57a70f2103e7e628f05864e21e979786059217eff5f253085112db2c36ff56 |