Enhancing translation with RAG-powered large language models

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

🦖 T-Ragx

T-Ragx Featured Image

Enhancing Translation with RAG-Powered Large Language Models

T-Ragx Demo:

🚧 T-Ragx Colab Tool [colab tool]🚧

TL;DR

Overview

Democratize high-quality machine translations
Open-soured system-level translation framework
Fluent/ natural translations using LLM
Private and secure local translations
Zero-shot in-task translations

Methods

QLoRA fine-tuned models
General + in-task translation memory/ glossary
Include preceding text for document-level translations for additional context

Results

In-task translation memory and glossary achieved a significant (~45%) increase in aggregated translation scores on the QLoRA Mistral 7b model
Great recall for valid translation memory/ glossary (i.e. previous translations/ character names)
Outperforms native TowerInstruct on all 6 language directions finetuned (Ja x Zh x En)
Outperforms DeepL in translating Japanese web novel (That Time I Got Reincarnated as a Slime) to Chinese with in-task memories
- Japanese -> Chinese
  - +29% by sacrebleu
  - +0.4% by comet22

Getting Started

Install

Simply run:

pip install t-ragx

or if you are feeling lucky:

pip install git+https://github.com/rayliuca/T-Ragx.git

Environment

Conda / Mamba (Recommended)

Download the conda environment.yml file and run:

mamba env create -f environment.yml

Which will crate a t_ragx environment that's compatible with this project

pip

Use your favourite virtual environment, and run:

pip install -r requirment.txt

Examples

Initiate the input processor:

import t_ragx

# Initiate the input processor which will retrieve the memory and glossary results for us
input_processor = t_ragx.Processors.ElasticInputProcessor()

# Load/ point to the demo resources
input_processor.load_general_glossary("https://l8u0.c18.e2-1.dev/t-ragx-public/glossary")
input_processor.load_general_translation(elasticsearch_host="t-ragx-fossil.rayliu.ca", elasticsearch_port=80)

Using the llama-cpp-python backend:

import t_ragx

# T-Ragx currently support 
# Huggingface transformers: MistralModel, InternLM2Model
# Ollama API: OllamaModel
# OpenAI API: OpenAIModel
# Llama-cpp-python backend: LlamaCppPythonModel
mistral_model = t_ragx.models.LlamaCppPythonModel(
    repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
    filename="*Q4_K_M*",
    # see https://huggingface.co/rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2
    # for other files
    chat_format="mistral-instruct",
    model_config={'n_ctx':2048}, # increase the context window
)

t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)

Translate!

t_ragx_translator.batch_translate(
    source_text_list,  # the input text list to translate
    pre_text_list=pre_text_list,  # optional, including the preceding context to translate the document level
    # Can generate via:
    # pre_text_list = t_ragx.utils.helper.get_preceding_text(source_text_list, max_sent=3)
    source_lang_code='ja',
    target_lang_code='en',
    memory_search_args={'top_k': 3}  # optional, pass additional arguments to input_processor.search_memory
)

Data Sources

Dataset	Translation Memory	Glossary	Training	Testing	License
OpenMantra	✅		✅		CC BY-NC 4.0
WMT < 2023	✅		✅		for research
ParaMed	✅		✅		cc-by-4.0
ted_talks_iwslt	✅		✅		cc-by-nc-nd-4.0
JESC	✅		✅		CC BY-SA 4.0
MTNT	✅				Custom/ Reddit API
WCC-JC	✅		✅		for research
ASPEC			✅		custom, for research
All other ja-en/zh-en OPUS data	✅				mix of open licenses: check https://opus.nlpl.eu/
Wikidata		✅			CC0
Tensei Shitara Slime Datta Ken Wiki		☑️ in task			CC BY-SA
WMT 2023				✅	for research
Tensei Shitara Slime Datta Ken Web Novel & web translations	☑️ in task			✅	Not included translation memory

Elasticsearch

Note: you can access a read-only preview T-Ragx Elasticsearch service at http://t-ragx-fossil.rayliu.ca:80 (But you will need a personal Elasticsearch service to add your in-task memories)

Install using Docker

See the T-Rex-Fossil repo

Install Locally

Note: this project was built with Elasticsearch 7

Download the Elasticsearch binary
Unzip
Enter into the unzipped folder
Install the plugins

bin/elasticsearch-plugin install repository-s3
bin/elasticsearch-plugin install analysis-icu
bin/elasticsearch-plugin install analysis-kuromoji
bin/elasticsearch-plugin install analysis-smartcn

Add the S3 keys

(The snapshot is stored on IDrive e2 which is apparently not compatible with S3 enough for read-only Elastic S3 repo to work)

This read-only key will help you connect to the snapshot

bin/elasticsearch-keystore add s3.client.default.access_key
CG4KwcrNPefWdJcsBIUp

bin/elasticsearch-keystore add s3.client.default.secret_key
Cau5uITwZ7Ke9YHKvWE9cXuTy5chdapBLhqVaI3C

Add the snapshot

curl -X PUT "http://localhost:9200/_snapshot/public_t_ragx_translation_memory" -H "Content-Type: application/json" -d "{\"type\":\"s3\",\"settings\":{\"bucket\":\"t-ragx-public\",\"base_path\":\"elastic\",\"endpoint\":\"o3t0.or.idrivee2-37.com\"}}"

Note: this is the JSON body:

{
  "type": "s3",
  "settings": {
    "bucket": "t-ragx-public",
    "base_path": "elastic",
    "endpoint": "o3t0.or.idrivee2-37.com"
  }
}

Restore the Snapshot

If you use any GUI client i.e. elasticvue, you likely could do this via their interface

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.1

Mar 30, 2024

0.1.0

Mar 30, 2024

0.0.10

Mar 6, 2024

0.0.9

Mar 6, 2024

0.0.8

Mar 5, 2024

0.0.7

Mar 5, 2024

This version

0.0.6

Mar 3, 2024

0.0.5

Mar 3, 2024

0.0.4

Mar 3, 2024

0.0.3

Mar 3, 2024

0.0.2

Mar 3, 2024

0.0.1

Jan 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

t_ragx-0.0.6.tar.gz (22.6 kB view hashes)

Uploaded Mar 3, 2024 Source

Built Distribution

t_ragx-0.0.6-py3-none-any.whl (28.0 kB view hashes)

Uploaded Mar 3, 2024 Python 3

Hashes for t_ragx-0.0.6.tar.gz

Hashes for t_ragx-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`7bdd63aa2fabadbd9f2071cb8140bc4b4ba84ebd6b2c75659921db1c1c85f8f4`
MD5	`c3a6374b047313b28dba9964ebe850c5`
BLAKE2b-256	`cdcbaa6b2dbc8dd83b4c840e99fd6f81c38e77e947611030a373dc79c6525e31`

Hashes for t_ragx-0.0.6-py3-none-any.whl

Hashes for t_ragx-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc139e827fe7bf9da5c5b5803d6b18ee2365bee527ebbf68b769cf07e67e7484`
MD5	`ff5dcfc26ea4f46c911aede0bb66b107`
BLAKE2b-256	`772a7f08525571250aa53b736d572688cf16b1120d034858da9d0fd54b8fd689`