Skip to main content

Enhancing translation with RAG-powered large language models

Project description

🦖 T-Ragx

T-Ragx Featured Image

Enhancing Translation with RAG-Powered Large Language Models


T-Ragx Demo: Open In Colab

TL;DR

Overview

  • Open-source system-level translation framework
  • Provides fluent and natural translations utilizing LLMs
  • Ensures privacy and security with local translation processes
  • Capable of zero-shot in-task translations

Methods

  • Utilizes QLoRA fine-tuned models for enhanced accuracy
  • Employs both general and in-task specific translation memories and glossaries
  • Incorporates preceding text in document-level translations for improved context understanding

Results

  • Combining QLoRA with in-task translation memory and glossary resulted in ~45% increase in aggregated WMT23 translation scores, benchmarked against the Mistral 7b Instruct model
  • Demonstrated high recall for valid translation memories and glossaries, including previous translations and character names
  • Surpassed the performance of the native TowerInstruct model in three (Ja<->En, Zh->En) out of the four WMT23 language direction tested
  • Outperformed DeepL in translating the Japanese web novel "That Time I Got Reincarnated as a Slime" into Chinese using in-task RAG
    • Japanese to Chinese translation improvements:
      • +29% sacrebleu
      • +0.4% comet22

👉See the write-up for more details📜

Getting Started

Install

Simply run:

pip install t-ragx

or if you are feeling lucky:

pip install git+https://github.com/rayliuca/T-Ragx.git

Elasticsearch

See the wiki page instructions

Note: you can access preview read-only T-Ragx Elasticsearch services at https://t-ragx-fossil.rayliu.ca and https://t-ragx-fossil2.rayliu.ca (But you will need a personal Elasticsearch service to add your in-task memories)

Environment

(Recommended) Conda / Mamba

Download the conda environment.yml file and run:

conda env create -f environment.yml

## or with mamba
# mamba env create -f environment.yml

Which will crate a t_ragx environment that's compatible with this project

pip

Download the requirment.txt file and run:

Use your favourite virtual environment, and run:

pip install -r requirment.txt

Examples

Initiate the input processor:

import t_ragx

# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()

# Load/ point to the demo resources
input_processor.load_general_glossary("https://t-ragx-public.s3.us-west-004.backblazeb2.com/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host=["https://t-ragx-fossil.rayliu.ca", "https://t-ragx-fossil2.rayliu.ca"])

Using the llama-cpp-python backend:

import t_ragx

# T-Ragx currently support 
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
    repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
    filename="*Q4_K_M*",
    # see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
    # for other files
    chat_format="mistral-instruct",
    model_config={'n_ctx':2048}, # increase the context window
)

t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)

Translate!

t_ragx_translator.batch_translate(
    source_text_list,  # the input text list to translate
    pre_text_list=pre_text_list,  # optional, including the preceding context to translate the document level
    # Can generate via:
    # pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
    source_lang_code='ja',
    target_lang_code='en',
    memory_search_args={'top_k': 3}  # optional, pass additional arguments to input_processor.search_memory
)

Models

Note: you could use any LLMs by using the API models (i.e. OllamaModel or OpenAIModel) or extending the t_ragx.models.BaseModel class

The following models were finetuned using the T-Ragx prompts, so they might work a bit better than some of the off-the-shelve models with T-Ragx

QLoRA Models:

Source Model Model Type Quantization Fine-tuned Model
mistralai/Mistral-7B-Instruct-v0.2 LoRA rayliuca/TRagx-Mistral-7B-Instruct-v0.2
merged AWQ AWQ rayliuca/TRagx-AWQ-Mistral-7B-Instruct-v0.2
merged GGUF Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32 rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
mlabonne/NeuralOmniBeagle-7B LoRA rayliuca/TRagx-NeuralOmniBeagle-7B
merged AWQ AWQ rayliuca/TRagx-AWQ-NeuralOmniBeagle-7B
merged GGUF Q3_K, Q4_K_M, Q5_K_M, Q5_K_S, Q6_K, F32 rayliuca/TRagx-GGUF-NeuralOmniBeagle-7B
internlm/internlm2-7b LoRA rayliuca/TRagx-internlm2-7b
merged GPTQ GPTQ rayliuca/TRagx-GPTQ-internlm2-7b
Unbabel/TowerInstruct-7B-v0.2 LoRA rayliuca/TRagx-TowerInstruct-7B-v0.2

Data Sources

All of the datasets used in the project

Dataset Translation Memory Glossary Training Testing License
OpenMantra CC BY-NC 4.0
WMT < 2023 for research
ParaMed cc-by-4.0
ted_talks_iwslt cc-by-nc-nd-4.0
JESC CC BY-SA 4.0
MTNT Custom/ Reddit API
WCC-JC for research
ASPEC custom, for research
All other ja-en/zh-en OPUS data mix of open licenses: check https://opus.nlpl.eu/
Wikidata CC0
Tensei Shitara Slime Datta Ken Wiki ☑️ in task CC BY-SA
WMT 2023 for research
Tensei Shitara Slime Datta Ken Web Novel & web translations ☑️ in task Not used for training or redistribution

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

t_ragx-0.1.2.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

t_ragx-0.1.2-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file t_ragx-0.1.2.tar.gz.

File metadata

  • Download URL: t_ragx-0.1.2.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.21

File hashes

Hashes for t_ragx-0.1.2.tar.gz
Algorithm Hash digest
SHA256 aa5a1f5ca927c34a761cf1cd4847969da20281064c6a609209ec69b936fad31f
MD5 7e3c9a4c2738bf9bba50a8a88a804f45
BLAKE2b-256 2a4c1c3fdd3df34f2142d9f2891fcecb5f48dbba113c966cdd6206e0e6122217

See more details on using hashes here.

File details

Details for the file t_ragx-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: t_ragx-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.21

File hashes

Hashes for t_ragx-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 aaee1e758a44b4061a7527379505f0c6259cc9a893f60392205236b8cdff08ce
MD5 4303627a3f2552052165247ede170979
BLAKE2b-256 bd3f5bb87327edf8e1dc62e4a373d96fd7af7c01f8f7b1c41c198e8bc3bae565

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page