Enhancing translation with RAG-powered large language models
Project description
🦖 T-Ragx
Enhancing Translation with RAG-Powered Large Language Models
TL;DR
Overview
- Open-source system-level translation framework
- Provides fluent and natural translations utilizing LLMs
- Ensures privacy and security with local translation processes
- Capable of zero-shot in-task translations
Methods
- Utilizes QLoRA fine-tuned models for enhanced accuracy
- Employs both general and in-task specific translation memories and glossaries
- Incorporates preceding text in document-level translations for improved context understanding
Results
- Combining QLoRA with in-task translation memory and glossary resulted in ~45% increase in aggregated WMT23 translation scores, benchmarked against the Mistral 7b Instruct model
- Demonstrated high recall for valid translation memories and glossaries, including previous translations and character names
- Surpassed the performance of the native TowerInstruct model in three (Ja<->En, Zh->En) out of the four WMT23 language direction tested
- Outperformed DeepL in translating the Japanese web novel "That Time I Got Reincarnated as a Slime" into Chinese using in-task RAG
- Japanese to Chinese translation improvements:
- +29% sacrebleu
- +0.4% comet22
- Japanese to Chinese translation improvements:
👉See the write-up for more details📜
Getting Started
Install
Simply run:
pip install t-ragx
or if you are feeling lucky:
pip install git+https://github.com/rayliuca/T-Ragx.git
Elasticsearch
See the wiki page instructions
Note: you can access preview read-only T-Ragx Elasticsearch services at https://t-ragx-fossil.rayliu.ca
and https://t-ragx-fossil2.rayliu.ca
(But you will need a personal Elasticsearch service to add your in-task memories)
Environment
(Recommended) Conda / Mamba
Download the conda environment.yml
file and run:
conda env create -f environment.yml
## or with mamba
# mamba env create -f environment.yml
Which will crate a t_ragx
environment that's compatible with this project
pip
Download the requirment.txt
file and run:
Use your favourite virtual environment, and run:
pip install -r requirment.txt
Examples
Initiate the input processor:
import t_ragx
# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()
# Load/ point to the demo resources
input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host=["https://t-ragx-fossil.rayliu.ca", "https://t-ragx-fossil2.rayliu.ca"])
Using the llama-cpp-python
backend:
import t_ragx
# T-Ragx currently support
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
filename="*Q4_K_M*",
# see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
# for other files
chat_format="mistral-instruct",
model_config={'n_ctx':2048}, # increase the context window
)
t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)
Translate!
t_ragx_translator.batch_translate(
source_text_list, # the input text list to translate
pre_text_list=pre_text_list, # optional, including the preceding context to translate the document level
# Can generate via:
# pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
source_lang_code='ja',
target_lang_code='en',
memory_search_args={'top_k': 3} # optional, pass additional arguments to input_processor.search_memory
)
Models
Note: you could use any LLMs by using the API models (i.e. OllamaModel
or OpenAIModel
) or extending the t_ragx.models.BaseModel
class
The following models were finetuned using the T-Ragx prompts, so they might work a bit better than some of the off-the-shelve models with T-Ragx
QLoRA Models:
Source Model | Model Type | Quantization | Fine-tuned Model |
---|---|---|---|
mistralai/Mistral-7B-Instruct-v0.2 | LoRA | rayliuca/TRagx-Mistral-7B-Instruct-v0.2 | |
merged AWQ | AWQ | rayliuca/TRagx-AWQ-Mistral-7B-Instruct-v0.2 | |
merged GGUF | Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32 | rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2 | |
mlabonne/NeuralOmniBeagle-7B | LoRA | rayliuca/TRagx-NeuralOmniBeagle-7B | |
merged AWQ | AWQ | rayliuca/TRagx-AWQ-NeuralOmniBeagle-7B | |
merged GGUF | Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32 | rayliuca/TRagx-GGUF-NeuralOmniBeagle-7B | |
internlm/internlm2-7b | LoRA | rayliuca/TRagx-internlm2-7b | |
merged GPTQ | GPTQ | rayliuca/TRagx-GPTQ-internlm2-7b | |
Unbabel/TowerInstruct-7B-v0.2 | LoRA | rayliuca/TRagx-TowerInstruct-7B-v0.2 |
Data Sources
All of the datasets used in the project
Dataset | Translation Memory | Glossary | Training | Testing | License |
---|---|---|---|---|---|
OpenMantra | ✅ | ✅ | CC BY-NC 4.0 | ||
WMT < 2023 | ✅ | ✅ | for research | ||
ParaMed | ✅ | ✅ | cc-by-4.0 | ||
ted_talks_iwslt | ✅ | ✅ | cc-by-nc-nd-4.0 | ||
JESC | ✅ | ✅ | CC BY-SA 4.0 | ||
MTNT | ✅ | Custom/ Reddit API | |||
WCC-JC | ✅ | ✅ | for research | ||
ASPEC | ✅ | custom, for research | |||
All other ja-en/zh-en OPUS data | ✅ | mix of open licenses: check https://opus.nlpl.eu/ | |||
Wikidata | ✅ | CC0 | |||
Tensei Shitara Slime Datta Ken Wiki | ☑️ in task | CC BY-SA | |||
WMT 2023 | ✅ | for research | |||
Tensei Shitara Slime Datta Ken Web Novel & web translations | ☑️ in task | ✅ | Not included translation memory |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file t_ragx-0.0.10.tar.gz
.
File metadata
- Download URL: t_ragx-0.0.10.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 790828f76ad6fe2c1046a036ea075843ff34aaac9aa311c2b140b305a2c3f6aa |
|
MD5 | eea35348414503af99486d4b0734a680 |
|
BLAKE2b-256 | 416e2b1248602e0fc3ff9c873abead803456713c9787f9c9865f1dc53529c53c |
File details
Details for the file t_ragx-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: t_ragx-0.0.10-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bded34c7521baec9c11268f327031c8aa5c608e2b83af9b7102d8684d159aa02 |
|
MD5 | d05560265921af44fc323b186cda61ed |
|
BLAKE2b-256 | 2edeac3d96e85f99a3e8fba089c412126c99eda3d174e3923e5944282734b1cf |