Enhancing translation with RAG-powered large language models
Project description
🦖 T-Ragx
Enhancing Translation with RAG-Powered Large Language Models
🚧 T-Ragx Colab Tool [colab tool]🚧
TL;DR
Overview
- Democratize high-quality machine translations
- Open-soured system-level translation framework
- Fluent/ natural translations using LLM
- Private and secure local translations
- Zero-shot in-task translations
Methods
- QLoRA fine-tuned models
- General + in-task translation memory/ glossary
- Include preceding text for document-level translations for additional context
Results
- In-task translation memory and glossary achieved a significant (~45%) increase in aggregated translation scores on the QLoRA Mistral 7b model
- Great recall for valid translation memory/ glossary (i.e. previous translations/ character names)
- Outperforms native TowerInstruct on all 6 language directions finetuned (Ja x Zh x En)
- Outperforms DeepL in translating Japanese web novel (That Time I Got Reincarnated as a Slime) to Chinese with in-task
memories
- Japanese -> Chinese
- +29% by sacrebleu
- +0.4% by comet22
- Japanese -> Chinese
Getting Started
Install
Simply run:
pip install t-ragx
or if you are feeling lucky:
pip install git+https://github.com/rayliuca/T-Ragx.git
Environment
Conda / Mamba (Recommended)
Download the conda environment.yml
file and run:
mamba env create -f environment.yml
Which will crate a t_ragx
environment that's compatible with this project
pip
Use your favourite virtual environment, and run:
pip install -r requirment.txt
Examples
Initiate the input processor:
import t_ragx
# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()
# Load/ point to the demo resources
input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host="t-ragx-fossil.rayliu.ca", elasticsearch_port=80)
Using the llama-cpp-python
backend:
import t_ragx
# T-Ragx currently support
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
filename="*Q4_K_M*",
# see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
# for other files
chat_format="mistral-instruct",
model_config={'n_ctx':2048}, # increase the context window
)
t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)
Translate!
t_ragx_translator.batch_translate(
source_text_list, # the input text list to translate
pre_text_list=pre_text_list, # optional, including the preceding context to translate the document level
# Can generate via:
# pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
source_lang_code='ja',
target_lang_code='en',
memory_search_args={'top_k': 3} # optional, pass additional arguments to input_processor.search_memory
)
Data Sources
Dataset | Translation Memory | Glossary | Training | Testing | License |
---|---|---|---|---|---|
OpenMantra | ✅ | ✅ | CC BY-NC 4.0 | ||
WMT < 2023 | ✅ | ✅ | for research | ||
ParaMed | ✅ | ✅ | cc-by-4.0 | ||
ted_talks_iwslt | ✅ | ✅ | cc-by-nc-nd-4.0 | ||
JESC | ✅ | ✅ | CC BY-SA 4.0 | ||
MTNT | ✅ | Custom/ Reddit API | |||
WCC-JC | ✅ | ✅ | for research | ||
ASPEC | ✅ | custom, for research | |||
All other ja-en/zh-en OPUS data | ✅ | mix of open licenses: check https://opus.nlpl.eu/ | |||
Wikidata | ✅ | CC0 | |||
Tensei Shitara Slime Datta Ken Wiki | ☑️ in task | CC BY-SA | |||
WMT 2023 | ✅ | for research | |||
Tensei Shitara Slime Datta Ken Web Novel & web translations | ☑️ in task | ✅ | Not included translation memory |
Elasticsearch
Note: you can access a read-only preview T-Ragx Elasticsearch service at http://t-ragx-fossil.rayliu.ca:80
(But you will need a personal Elasticsearch service to add your in-task memories)
Install using Docker
See the T-Rex-Fossil repo
Install Locally
Note: this project was built with Elasticsearch 7
- Download the Elasticsearch binary
- Unzip
- Enter into the unzipped folder
- Install the plugins
bin/elasticsearch-plugin install repository-s3
bin/elasticsearch-plugin install analysis-icu
bin/elasticsearch-plugin install analysis-kuromoji
bin/elasticsearch-plugin install analysis-smartcn
-
Add the S3 keys
(The snapshot is stored on IDrive e2 which is apparently not compatible with S3 enough for read-only Elastic S3 repo to work)
This read-only key will help you connect to the snapshot
bin/elasticsearch-keystore add s3.client.default.access_key
CG4KwcrNPefWdJcsBIUp
bin/elasticsearch-keystore add s3.client.default.secret_key
Cau5uITwZ7Ke9YHKvWE9cXuTy5chdapBLhqVaI3C
- Add the snapshot
curl -X PUT "http://localhost:9200/_snapshot/public_t_ragx_translation_memory" -H "Content-Type: application/json" -d "{\"type\":\"s3\",\"settings\":{\"bucket\":\"t-ragx-public\",\"base_path\":\"elastic\",\"endpoint\":\"o3t0.or.idrivee2-37.com\"}}"
Note: this is the JSON body:
{
"type": "s3",
"settings": {
"bucket": "t-ragx-public",
"base_path": "elastic",
"endpoint": "o3t0.or.idrivee2-37.com"
}
}
-
Restore the Snapshot
If you use any GUI client i.e. elasticvue, you likely could do this via their interface
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.