Skip to main content

Enhancing translation with RAG-powered large language models

Project description

🦖 T-Ragx

T-Ragx Featured Image

Enhancing Translation with RAG-Powered Large Language Models

🚧 T-Ragx Demo [colab demo]🚧

🚧 T-Ragx Colab Tool [colab tool]🚧

TL;DR

Overview

  • Democratize high-quality machine translations
  • Open-soured system-level translation framework
  • Fluent/ natural translations using LLM
  • Private and secure local translations
  • Zero-shot in-task translations

Methods

  • QLoRA fine-tuned models
  • General + in-task translation memory/ glossary
  • Include preceding text for document-level translations for additional context

Results

  • In-task translation memory and glossary achieved a significant (~45%) increase in aggregated translation scores on the QLoRA Mistral 7b model
  • Great recall for valid translation memory/ glossary (i.e. previous translations/ character names)
  • Outperforms native TowerInstruct on all 6 language directions finetuned (Ja x Zh x En)
  • Outperforms DeepL in translating Japanese web novel (That Time I Got Reincarnated as a Slime) to Chinese with in-task memories
    • Japanese -> Chinese
      • +29% by sacrebleu
      • +0.4% by comet22

Getting Started

Install

Environment

Conda / Mamba (Recommended)

pip

Use your favourite virtual environment, and run:

pip install -r requirment.txt

Examples

Initiate the input processor:

import t_ragx

# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()

# Load/ point to the demo resources
input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host="t-ragx-fossil.rayliu.ca", elasticsearch_port=80)

Using the llama-cpp-python backend:

import t_ragx

# T-Ragx currently support 
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
    repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
    filename="*Q4_K_M*",
    # see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
    # for other files
    chat_format="mistral-instruct",
    model_config={'n_ctx':2048}, # increase the context window
)

t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)

Translate!

t_ragx_translator.batch_translate(
    source_text_list,  # the input text list to translate
    pre_text_list=pre_text_list,  # optional, including the preceding context to translate the document level
    # Can generate via:
    # pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
    source_lang_code='ja',
    target_lang_code='en',
    memory_search_args={'top_k': 3}  # optional, pass additional arguments to input_processor.search_memory
)

Data Sources

Dataset Translation Memory Glossary Training Testing License
OpenMantra CC BY-NC 4.0
WMT < 2023 for research
ParaMed cc-by-4.0
ted_talks_iwslt cc-by-nc-nd-4.0
JESC CC BY-SA 4.0
MTNT Custom/ Reddit API
WCC-JC for research
ASPEC custom, for research
All other ja-en/zh-en OPUS data mix of open licenses: check https://opus.nlpl.eu/
Wikidata CC0
Tensei Shitara Slime Datta Ken Wiki ☑️ in task CC BY-SA
WMT 2023 for research
Tensei Shitara Slime Datta Ken Web Novel & web translations ☑️ in task Not included translation memory

Elasticsearch

Note: you can access a read-only preview T-Ragx Elasticsearch service at http://t-ragx-fossil.rayliu.ca:80 (But you will need a personal Elasticsearch service to add your in-task memories)

Install using Docker

See the T-Rex-Fossil repo

Install Locally

Note: this project was built with Elasticsearch 7

  1. Download the Elasticsearch binary
  2. Unzip
  3. Enter into the unzipped folder
  4. Install the plugins
bin/elasticsearch-plugin install repository-s3
bin/elasticsearch-plugin install analysis-icu
bin/elasticsearch-plugin install analysis-kuromoji
bin/elasticsearch-plugin install analysis-smartcn
  1. Add the S3 keys

    (The snapshot is stored on IDrive e2 which is apparently not compatible with S3 enough for read-only Elastic S3 repo to work)

    This read-only key will help you connect to the snapshot

bin/elasticsearch-keystore add s3.client.default.access_key
CG4KwcrNPefWdJcsBIUp

bin/elasticsearch-keystore add s3.client.default.secret_key
Cau5uITwZ7Ke9YHKvWE9cXuTy5chdapBLhqVaI3C
  1. Add the snapshot
curl -X PUT "http://localhost:9200/_snapshot/public_t_ragx_translation_memory" -H "Content-Type: application/json" -d "{\"type\":\"s3\",\"settings\":{\"bucket\":\"t-ragx-public\",\"base_path\":\"elastic\",\"endpoint\":\"o3t0.or.idrivee2-37.com\"}}"

Note: this is the JSON body:

{
  "type": "s3",
  "settings": {
    "bucket": "t-ragx-public",
    "base_path": "elastic",
    "endpoint": "o3t0.or.idrivee2-37.com"
  }
}
  1. Restore the Snapshot

    If you use any GUI client i.e. elasticvue, you likely could do this via their interface

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

t_ragx-0.0.4.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

t_ragx-0.0.4-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file t_ragx-0.0.4.tar.gz.

File metadata

  • Download URL: t_ragx-0.0.4.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for t_ragx-0.0.4.tar.gz
Algorithm Hash digest
SHA256 7ac46d4d7c1a3ab00be0caa03a28f13bc3885e9ff886f76264bf56b82e00b7d7
MD5 d391381cc5ee69a015290c18a6752f5e
BLAKE2b-256 1d36f97c97c1644903dafd7e8725a94d4527ca0735af9c43c99922effcfdccfe

See more details on using hashes here.

File details

Details for the file t_ragx-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: t_ragx-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for t_ragx-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 fe7b3ea8943e48caf5078ab71c451d53e7ddfb8224c9fe61f90787a55e4828bf
MD5 1f369e7558eff96da54a553a3c76b5c1
BLAKE2b-256 66894db96e05c4ba78e496e9dcdc658ebe208af421a4db8ab025ecdf6b6ff107

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page